in

Bettering mathematical reasoning with course of supervision



We have educated a mannequin to realize a brand new state-of-the-art in mathematical downside fixing by rewarding every right step of reasoning (“course of supervision”) as a substitute of merely rewarding the right last reply (“consequence supervision”). Along with boosting efficiency relative to consequence supervision, course of supervision additionally has an vital alignment profit: it straight trains the mannequin to supply a chain-of-thought that’s endorsed by people.


OpenAI cybersecurity grant program

Democratic inputs to AI