This lab is under construction. Track progress in AURA-654.
DeepSeek-R1-Distill-Qwen-1.5B using Group Relative Policy Optimization (GRPO) on a compact mathematical reasoning dataset, reproducing the three experiments from the paper.
Key results
| Benchmark | Baseline | After GRPO |
|---|---|---|
| AMC23 | 63% | 80% |
| AIME24 | — | 46.7% |
Resources
- open-rs repository
- arXiv paper
- Models: Open-RS1, Open-RS2, Open-RS3
- Datasets: open-s1, open-deepscaler, open-rs

