Skip to main content
PPO actor-critic loop: query q enters the Policy Model, which generates output o evaluated by a frozen Reference Model (KL penalty), a frozen Reward Model, and a trainable Value Model; combined reward r and value estimate v feed into GAE to produce advantage A; dashed arcs show actor and critic gradient paths back to their respective models Editable Mermaid source: images/ppo-loop.mermaid.md

Analytical Derivations

Further reading