Scaling laws for re-optimization of reward models


In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Since the reward model is an imperfect substitute, over-optimizing its value may compromise the ground-truth performance, according to Goodhart's law. This effect is often observed but not carefully measured due to the cost of collecting data on human preferences. In this paper, we use a synthetic setup in which a fixed “gold standard” reward model plays the role of humans, providing the labels used to train a proxy reward model. We study how the output of the golden reward model changes as we optimize against a proxy reward model using either reinforcement learning or best-of-n sampling. We found that this relationship has a different functional form depending on the optimization method, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of reward model dataset size, the number of reward model and policy parameters, and the KL penalty coefficient added to the reward in a reinforcement learning setting. We explore the implications of these empirical results for theoretical considerations of AI alignment.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *