images/grpo-loop.mermaid.md
Policy-Value-based Control
Grouped Relative Policy Optimization
GRPO removes the value critic of PPO by sampling G outputs per prompt and computing advantages from group-normalized rewards, reducing memory and compute while retaining stable policy updates.

