Skip to main content
This lab is under construction. Track progress in AURA-654.
This lab is based on the AAAI 2026 paper Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t and the accompanying open-rs repository. You will fine-tune DeepSeek-R1-Distill-Qwen-1.5B using Group Relative Policy Optimization (GRPO) on a compact mathematical reasoning dataset, reproducing the three experiments from the paper.

Key results

BenchmarkBaselineAfter GRPO
AMC2363%80%
AIME2446.7%
Training runs on 4× NVIDIA A40 GPUs (48 GB VRAM) in under 24 hours at a cost of ~$42.

Resources