GRPO removes the value critic of PPO by sampling G outputs per prompt and computing advantages from group-normalized rewards, reducing memory and compute while retaining stable policy updates.
images/grpo-loop.mermaid.md