Important contradictions still exist between our current empirical setting and the ultimate problem of reconciling the superhuman model. For example, it may be easier for a future model to imitate the errors of a weak human model than it is for a current strong model to imitate the errors of a current weak model, making generalization difficult in the future. There is a gender.
Nevertheless, we believe that our setup captures some of the key difficulties in calibrating future superhuman models and that we can begin to make empirical progress on this issue today. . There is much for future research, including correcting the discrepancies in our setting, developing more scalable methods, and advancing scientific understanding of when and how good generalization from weak to strong should be expected. There are promising directions.
We believe this is a great opportunity for the ML research community to collaborate. To start further research in this field,
- Now on sale open source code We want to make it easy for you to start experimenting with generalization from weak to strong today.
- We are launching a $10 million grant program for graduate students, academics, and other researchers working broadly on tuning superhuman AI. We are particularly excited to support research related to weak-to-strong generalization.
Finding ways to safely tune future superhuman AI systems has never been more important, and it has never been easier to make empirical progress on this issue. We look forward to seeing what breakthroughs researchers discover.