Scaling law for overoptimization of reward models
Reinforcement learning from human feedback typically optimizes against a reward model that has been trained to predict human preferences. Since the reward model is an imperfect proxy, overoptimizing its value…