We aim to improve mathematical problem solving by rewarding each correct step of reasoning (“process monitoring”) rather than simply rewarding the final correct answer (“outcome monitoring”). We trained a model that achieved a new state of the art. In addition to improved performance compared to outcome monitoring, process monitoring also has important coordination benefits. It’s about directly training the model to generate chains of thought that are approved by humans.