RL for Reasoning in Small LLMs - aegean.ai

This lab is under construction. Track progress in AURA-654.

This lab is based on the AAAI 2026 paper Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t and the accompanying open-rs repository. You will fine-tune DeepSeek-R1-Distill-Qwen-1.5B using Group Relative Policy Optimization (GRPO) on a compact mathematical reasoning dataset, reproducing the three experiments from the paper.

Key results

Benchmark	Baseline	After GRPO
AMC23	63%	80%
AIME24	—	46.7%

Training runs on 4× NVIDIA A40 GPUs (48 GB VRAM) in under 24 hours at a cost of ~$42.

Resources

open-rs repository
arXiv paper
Models: Open-RS1, Open-RS2, Open-RS3
Datasets: open-s1, open-deepscaler, open-rs

Edit this page on GitHub or file an issue.