Improving mathematical reasoning with process control



We trained the model to achieve a new state of the art in mathematical problem solving by rewarding each correct inference step (“process monitoring”) instead of simply rewarding the correct final answer (“outcome monitoring”). In addition to increasing performance over outcome monitoring, process monitoring also has an important alignment advantage: it directly trains the model to produce a human-supported chain of reasoning.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *