We trained the model to achieve a new state of the art in mathematical problem solving by rewarding each correct inference step (“process monitoring”) instead of simply rewarding the correct final answer (“outcome monitoring”). In addition to increasing performance over outcome monitoring, process monitoring also has an important alignment advantage: it directly trains the model to produce a human-supported chain of reasoning.
Source link