Skip to main content
This section covers emergent reasoning capabilities in large and vision-language models, and how to elicit and extend them through structured prompting and interaction patterns.

LLM Reasoning

Chain-of-Thought

Step-by-step reasoning that prompts LLMs to articulate intermediate reasoning steps before arriving at a final answer.

Tool Use

Extending LLM capabilities with external tools such as calculators, search engines, and code interpreters.

ReAct

Interleaving reasoning traces and actions, enabling LLMs to dynamically plan and interact with environments.

Lab

RL for Reasoning in Small LLMs

AAAI 2026 — Fine-tune DeepSeek-R1-Distill-Qwen-1.5B with GRPO on a compact math dataset. AMC23 accuracy improves from 63% to 80%; AIME24 reaches 46.7%, surpassing o1-preview. Full training run costs ~$42 on 4× A40 GPUs.

VLM Reasoning

VLM Reasoning

Visual chain-of-thought and grounded tool use for vision-language models.

Further reading

RL-for-reasoning training pipeline: (1) Generation — a batch of sequences is rolled out by Generators under policy πᵢ; (2) Scoring — Verifiers judge each completion and emit a reward; (3) Batching — scored sequences from the current policy are assembled into the step-i batch; (4) Weight update — Trainers take the batch and produce the next policy πᵢ₊₁ = πᵢ + ∇J. Dotted lines mark policy changes between batches; faded horizontal tracks carry over from previous policies. RL-for-reasoning training loop: the four canonical stages (Generation → Scoring → Batching → Weight update) that every GRPO / RLHF / RLAIF pipeline implements.