Grouped Relative Policy Optimization - aegean.ai

GRPO training loop: query q enters the Policy Model, which samples G outputs o_1 to o_G; a frozen Reward Model scores each output into r_1 to r_G; Group Computation normalizes the scores into advantages A_1 to A_G; a frozen Reference Model provides a per-token KL penalty fed back to the Policy Model

Editable Mermaid source: images/grpo-loop.mermaid.md

Edit this page on GitHub or file an issue.

PPO: Minimal Finite-State Examples

Model Based Algorithms and World Models