Post training a VLM for reasoning with GRPO using TRL
Authored by: Sergio Paniego
🚨 WARNING: This notebook is resource-intensive and requires substantial computational power. If you’re running this in Colab, it will utilize an A100 GPU.
In this recipe, we’ll demonstrate how to post-train a Vision Language Model (VLM) using GRPO for adding reasoning capabilities to a VLM using the Hugging Face ecosystem, specifically with the Transformer Reinforcement Learning library (trl).
We’ll be fine-tuning Qwen2.5-VL-3B-Instruct using a subset of the lmms-lab/multimodal-open-r1-8k-verified dataset. This dataset includes images with problem descriptions along with their solution and thinking trace to reach that solution. We’ll leverage this data format, along with the GRPO reward functions, to teach the model how to reason to reach the solution.
1. Install Dependencies
Let’s start by installing the essential libraries we’ll need for fine-tuning.
We’ll install trl from source, as the VLM GRPO trainer hasn’t been included in an official release at the time of writing.
# Container already has trl, peft (>=0.19), torch. Extra deps (math_verify, qwen-vl-utils) from pyproject notebooks extras.
import sys
print(f"Python: {sys.version}")
Python: 3.11.13 | packaged by conda-forge | (main, Jun 4 2025, 14:48:23) [GCC 13.3.0]
Authenticate using your Hugging Face 🤗 account to save and share the trained model.
import os
from huggingface_hub import whoami
try:
info = whoami()
name = info.get("fullname") or info.get("name", "unknown")
print(f"Logged in as: {name}")
except Exception:
if "HF_TOKEN" in os.environ:
print("HF_TOKEN found in environment")
else:
print("WARNING: HF_TOKEN not set")
Logged in as: Pantelis Monogioudis
/workspaces/eng-ai-agents/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
2. Load Dataset 📁
We leverage lmms-lab/multimodal-open-r1-8k-verified for this recipe. This dataset contains 8k multimodal RL training examples focused on math reasoning. This data was created using GPT4o and includes image, problem, solution, original question and original answer for each sample. It was created in this project.
For our particular case where we want the model to learn to reason using images, we use image and problem as input and solution as output.
For this educational resource, we’ll only use 5% of the dataset and divide it into train and test sets to make it faster to train. In a real training, we’d use the full dataset.
We’ll load the dataset and divide it.
from datasets import load_dataset
dataset_id = 'lmms-lab/multimodal-open-r1-8k-verified'
dataset = load_dataset(dataset_id, split='train[:5%]')
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']
Generating train split: 0%| | 0/7689 [00:00<?, ? examples/s]
Generating train split: 20%|█▉ | 1500/7689 [00:00<00:00, 14182.66 examples/s]
Generating train split: 40%|████ | 3100/7689 [00:00<00:00, 14627.79 examples/s]
Generating train split: 64%|██████▎ | 4900/7689 [00:00<00:00, 15513.49 examples/s]
Generating train split: 88%|████████▊ | 6800/7689 [00:00<00:00, 16636.44 examples/s]
Generating train split: 100%|██████████| 7689/7689 [00:00<00:00, 15562.20 examples/s]
Let’s check the structure of the dataset.
Dataset({
features: ['image', 'problem', 'solution', 'original_question', 'original_answer'],
num_rows: 307
})
Let’s check one sample:
{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=716x200 at 0x7617A24E5950>, 'problem': 'Based on the image, determine the constant term after combining all the polynomial expressions representing the side lengths of the triangle. Choose the correct answer from the options provided.\n\nChoices:\nA. 3\nB. 5\nC. 8\nD. 13', 'solution': "<think>Let's examine the polynomial expressions given for each side of the triangle. The side labeled \\(4x^2 + x\\) does not have a constant term. The side labeled \\(2x + 3\\) has a constant term of 3. The side labeled \\(4x^3 + 2x^2 + 5\\) has a constant term of 5. To find the total constant term, we need to add the constant terms from these expressions. So, we add 3 and 5 together. 3 + 5 = 8</think>\n\n<answer>The correct answer is C</answer>", 'original_question': 'According to the question shown in the image, please first perform reasoning, then finally select the right answer from the choices, e.g., Answer: xxx.\nQuestion: Based on the image, find the constant term after combining the side lengths.\nChoices:\nA. 3\nB. 5\nC. 8\nD. 13', 'original_answer': 'The constant terms from the sides $2 x + 3$ and $4 x^3 + 2 x^2 + 5$ are combined as $3 + 5 = 8$. So the answer is C\nAnswer: C'}
In addition to the problem and image columns, we also include a custom system prompt to tell the model how we’d like the generation.
The system prompt is extracted from DeepSeek R1. Refer to this previous recipe for more details.
We convert the dataset samples into conversation samples, including the system prompt and one image and problem description per sample, since this is how the GRPO trainer expects them.
We also set padding_side="left" to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses.
from transformers import AutoProcessor
model_id = "Qwen/Qwen3-VL-2B-Instruct"
processor = AutoProcessor.from_pretrained(model_id, use_fast=True, padding_side="left")
SYSTEM_PROMPT = (
"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
"first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
"process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
"<think> reasoning process here </think><answer> answer here </answer>"
)
def make_conversation(example):
# Return raw conversational prompt - TRL's GRPOTrainer applies the chat template internally
return {
"prompt": [
{"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": example["problem"]},
],
},
],
"image": example["image"],
}
train_dataset = train_dataset.map(make_conversation)
The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
Map: 0%| | 0/307 [00:00<?, ? examples/s]
Map: 16%|█▌ | 48/307 [00:00<00:00, 457.59 examples/s]
Map: 34%|███▍ | 104/307 [00:00<00:00, 507.30 examples/s]
Map: 51%|█████▏ | 158/307 [00:00<00:00, 516.91 examples/s]
Map: 72%|███████▏ | 221/307 [00:00<00:00, 545.83 examples/s]
Map: 92%|█████████▏| 281/307 [00:00<00:00, 560.81 examples/s]
Map: 100%|██████████| 307/307 [00:01<00:00, 203.78 examples/s]
Let’s take a look at a converted example:
print(train_dataset[0]['prompt'])
[{'content': [{'type': 'text', 'text': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>'}], 'role': 'system'}, {'content': [{'type': 'image'}, {'type': 'text', 'text': 'Based on the image, determine the constant term after combining all the polynomial expressions representing the side lengths of the triangle. Choose the correct answer from the options provided.\n\nChoices:\nA. 3\nB. 5\nC. 8\nD. 13'}], 'role': 'user'}]
We’ll remove the the columns that we don’t need for training.
Dataset({
features: ['image', 'problem', 'solution', 'original_question', 'original_answer', 'prompt'],
num_rows: 307
})
We can check that the columns are now gone.
train_dataset = train_dataset.remove_columns(['problem', 'original_question', 'original_answer'])
print(train_dataset)
Dataset({
features: ['image', 'solution', 'prompt'],
num_rows: 307
})
3. Post-Training the VLM Using GRPO
The diagram below highlights the main differences between PPO (Proximal Policy Optimization) and GRPO (Group Relative Policy Optimization), specifically the removal of the value model in GRPO. For more detailed information on the key differences, you can refer to this further explanation.
To implement the training pipeline, we leverage trl, Hugging Face’s library for reinforcement learning, which provides a streamlined interface and built-in support for key training algorithms. In our case, we use the GRPOConfig and GRPOTrainer classes. A crucial step in this process is defining custom reward functions that guide the model’s behavior and help it align with our specific objectives.
But first, let’s load the model. In this case, we use Qwen/Qwen2.5-VL-3B-Instruct, a powerful VLM developed by Qwen. For better results, it would be important to consider models with a larger number of parameters.
Others examples of VLM projects that include reasoning capabilities are:
3.1 Loading the Baseline Model
Let’s load the baseline model first. As previously introduced, Qwen/Qwen2.5-VL-3B-Instruct.
import torch
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained(
pretrained_model_name_or_path=model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 0%| | 0/625 [00:00<?, ?it/s]
Loading weights: 17%|█▋ | 109/625 [00:00<00:00, 1074.05it/s]
Loading weights: 35%|███▍ | 217/625 [00:00<00:00, 1061.74it/s]
Loading weights: 60%|██████ | 376/625 [00:00<00:00, 1297.12it/s]
Loading weights: 100%|██████████| 625/625 [00:00<00:00, 1582.08it/s]
3.2 Configuring LoRA
We’ll leverage LoRA for training the model, so let’s configure it.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
task_type="CAUSAL_LM",
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
trainable params: 1,605,632 || all params: 2,129,137,664 || trainable%: 0.0754
3.3 Loading Reward Functions
For the reward component of the system, we can use either pretrained reward models or reward functions defined directly in code. For training, the DeepSeek-R1 authors used an accuracy-based reward model that evaluates whether the response is correct, alongside a format-based reward that ensures the model places its reasoning process between <think> </think> tags. You can find more details here. We can simply define and implement these reward functions as generic Python functions.
In this case, we will utilize the following reward functions, directly extracted from the Open R1 implementation:
- Format Enforcement: Ensures that the generation follows a specific format using
<think> </think> <answer> </answer> tags for reasoning.
import re
def _extract_text(c):
"""Extract text from a completion which may be a string or list of messages."""
if isinstance(c, str):
return c
if isinstance(c, list):
parts = []
for m in c:
content = m.get("content") if isinstance(m, dict) else m
if isinstance(content, str):
parts.append(content)
elif isinstance(content, list):
for el in content:
if isinstance(el, dict) and "text" in el:
parts.append(el["text"])
return "".join(parts)
return str(c)
def format_reward(completions, **kwargs):
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>$"
texts = [_extract_text(c) for c in completions]
matches = [re.match(pattern, t, re.DOTALL | re.MULTILINE) for t in texts]
return [1.0 if m else 0.0 for m in matches]
- Solution Accuracy: Verifies whether the solution to the problem is correct, comparing it to the
solution column in the dataset.
from math_verify import LatexExtractionConfig, parse, verify
from latex2sympy2_extended import NormalizationConfig
from typing import Optional
def accuracy_reward(completions: list[list[dict[str, str]]], solution: list[str], **kwargs) -> list[Optional[float]]:
"""Reward function that checks if the completion matches the ground truth.
- If both gold and prediction are parseable → use math verification.
- If not parseable → compare as normalized text.
"""
rewards = []
for completion, sol in zip(completions, solution):
completion = _extract_text(completion) # handle messages format
try:
gold_parsed = parse(sol, extraction_mode="first_match")
except Exception as e:
gold_parsed = []
if len(gold_parsed) != 0:
# Try parsing predicted answer too
try:
answer_parsed = parse(
completion,
extraction_config=[
LatexExtractionConfig(
normalization_config=NormalizationConfig(
nits=False,
malformed_operators=False,
basic_latex=True,
boxed="all",
units=True,
),
boxed_match_priority=0,
try_extract_without_anchor=False,
)
],
extraction_mode="first_match",
)
reward = float(verify(gold_parsed, answer_parsed))
except Exception as e:
print(f"verify failed: {e}, answer: {completion}, gold: {sol}")
reward = None
else:
# fallback to text match
reward = float(completion.strip().lower() == sol.strip().lower())
rewards.append(reward)
return rewards
3.4 Configuring GRPO Training Parameters
Next, let’s configure the training parameters for GRPO. We recommend experimenting with the max_completion_length, num_generations, and max_prompt_length parameters.
It’d be interesting to play with the max_completion_length, num_generations, and max_prompt_length params in order to find the best training combination.
The parameter selection has been adjusted to fit within the hardware limitations of a Google Colab session. To observe the full potential of reward improvements, especially in the second objective function, and to further improve the model’s reasoning capabilities in a real-world scenario, a more ambitious setup would be required. This would involve larger models, an increased number of generations, and a high-quality, diverse dataset.
from trl import GRPOConfig
# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
output_dir="./Qwen3-VL-2B-Instruct-Thinking",
learning_rate=1e-5,
remove_unused_columns=False, # to access the solution column in accuracy_reward
num_train_epochs=1,
bf16=True,
# Parameters that control the data preprocessing
per_device_train_batch_size=2,
max_completion_length=1024, # default: 256
num_generations=2, # default: 8
# Parameters related to reporting and saving
report_to=["tensorboard"],
logging_steps=10,
push_to_hub=True,
hub_model_id=os.environ.get("HF_ACCOUNT", "aegean-ai") + "/Qwen3-VL-2B-Instruct-Thinking",
save_strategy="steps",
save_steps=10,
)
3.5 Training the Model 🏃
Now, let’s configure the trainer and start training the model!
In this case, we pass the two reward functions we previously defined to the trainer, in addition with the model, trainings arguments and dataset.
Below, you’ll find a diagram of the training procedure we’ll be reproducing, which is extracted from the Open-R1 project.
from trl import GRPOTrainer
trainer = GRPOTrainer(
model=model,
processing_class=processor,
reward_funcs=[format_reward, accuracy_reward],
args=training_args,
train_dataset=train_dataset,
)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1776194132.555840 171 cpu_feature_guard.cc:227] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1776194133.978015 171 cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
Time to train the model!
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 151645, 'bos_token_id': None, 'pad_token_id': 151643}.
[307/307 1:48:37, Epoch 1/1]
| Step | Training Loss |
|---|
| 10 | 0.000000 |
| 20 | 0.000000 |
| 30 | 0.040757 |
| 40 | 0.000000 |
| 50 | 0.067859 |
| 60 | 0.000000 |
| 70 | 0.000000 |
| 80 | 0.000000 |
| 90 | 0.000000 |
| 100 | 0.000000 |
| 110 | 0.000000 |
| 120 | 0.000000 |
| 130 | 0.000000 |
| 140 | 0.000000 |
| 150 | 0.000000 |
| 160 | 0.015370 |
| 170 | 0.000000 |
| 180 | 0.000000 |
| 190 | 0.000000 |
| 200 | 0.000000 |
| 210 | 0.000000 |
| 220 | 0.052498 |
| 230 | 0.000000 |
| 240 | 0.000000 |
| 250 | 0.000000 |
| 260 | 0.013925 |
| 270 | 0.000000 |
| 280 | 0.000000 |
| 290 | 0.013467 |
| 300 | 0.000000 |
TrainOutput(global_step=307, training_loss=0.0066409140631119665, metrics={'train_runtime': 6547.7047, 'train_samples_per_second': 0.047, 'train_steps_per_second': 0.047, 'total_flos': 0.0, 'train_loss': 0.0066409140631119665})
We can review the training metrics directly in TensorBoard on the [model page]((https://huggingface.co/sergiopaniego/Qwen2.5-VL-3B-Instruct-Thinking/tensorboard). While the loss curve might look a bit off, the reward results tell a clearer story: the model steadily improves, increasing the amount of reward it receives over time.
Now, let’s save the results in our account 💾
trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name=dataset_id)
Processing Files (0 / 0): | | 0.00B / 0.00B
New Data Upload: | | 0.00B / 0.00B
Processing Files (5 / 5): 100%|██████████| 18.0MB / 18.0MB, ???B/s
Processing Files (5 / 5): 100%|██████████| 18.0MB / 18.0MB, 0.00B/s
New Data Upload: | | 0.00B / 0.00B, 0.00B/s
No files have been modified since last commit. Skipping to prevent empty commit.
Processing Files (0 / 0): | | 0.00B / 0.00B
New Data Upload: | | 0.00B / 0.00B
Processing Files (5 / 5): 100%|██████████| 18.0MB / 18.0MB, ???B/s
Processing Files (5 / 5): 100%|██████████| 18.0MB / 18.0MB, 0.00B/s
New Data Upload: | | 0.00B / 0.00B, 0.00B/s
CommitInfo(commit_url='https://huggingface.co/aegean-ai/Qwen3-VL-2B-Instruct-Thinking/commit/f8de9b07c8cebce3240a85660d6a45ee26265ecb', commit_message='End of training', commit_description='', oid='f8de9b07c8cebce3240a85660d6a45ee26265ecb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/aegean-ai/Qwen3-VL-2B-Instruct-Thinking', endpoint='https://huggingface.co', repo_type='model', repo_id='aegean-ai/Qwen3-VL-2B-Instruct-Thinking'), pr_revision=None, pr_num=None)
Now that we’ve our model trained, we can check it’s performance to evaluate it qualitatively.
We recommend restarting your session in order to free the resources used for training.
import os
trained_model_id = os.environ.get("HF_ACCOUNT", "aegean-ai") + "/Qwen3-VL-2B-Instruct-Thinking"
For that, we will be using the test subset of our dataset. Let’s first load our trained model and it’s processor.
# Reuse the trained model directly (avoids any transformers<->peft reload issues)
from transformers import AutoProcessor
trained_model = trainer.model
trained_processor = processor
print("Using trainer model for inference")
Using trainer model for inference
We’ll generate an auxiliary function for generating our responses. This will make it easier for us to just send a problem and image and retrieve the model response, which should include the reasoning trace and final answer.
import time
import torch
from qwen_vl_utils import process_vision_info
def generate_with_reasoning(problem, image):
# Conversation setting for sending to the model
conversation = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": problem},
],
},
]
prompt = trained_processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=False
)
# Process images using the process_vision_info from qwen_vl_utils
image_inputs, video_inputs = process_vision_info(conversation)
inputs = processor(
text=[prompt],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(trained_model.device)
# Generate text without gradients
start_time = time.time()
with torch.no_grad():
output_ids = trained_model.generate(**inputs, max_new_tokens=500)
end_time = time.time()
# Decode and extract model response
generated_text = trained_processor.decode(output_ids[0], skip_special_tokens=True)
# Get inference time
inference_duration = end_time - start_time
# Get number of generated tokens
num_input_tokens = inputs["input_ids"].shape[1]
num_generated_tokens = output_ids.shape[1] - num_input_tokens
return generated_text, inference_duration, num_generated_tokens
Let’s check it!
generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(test_dataset[0]['problem'], test_dataset[0]['image'])
print(generated_text)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Caching is incompatible with gradient checkpointing in Qwen3VLTextDecoderLayer. Setting `past_key_values=None`.
system
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>
user
Based on the image, determine the sine value of angle AOB if it measures 120 degrees. Choose the correct answer from the options provided.
Choices:
A. $\frac{\sqrt{3}}{2}$
B. $\frac{1}{2}$
C. $-\frac{\sqrt{3}}{2}$
D. $\sqrt{2}$
assistant
< 7.18933359196699362000515. A.993396 767880996935. 819965095. A A B18009923895865.298656 1933567237550.1933688859932039 95581633199 581837 36335196266. **369937319 5606.9977 60589889 20756633588517608. 15752800.5900001951968800015. 519909963927 8319919609689359978859733907975. I 1919607626 6190. . 972885835787 6677859597.2000979 9 20.75955622009196189078519969192920193678 778815968919997629 778.9779889589999925616953.8519825099371978669800538 219620919539095 89 196197568532000577 9292539385315053.7 399
The answer seems to follow the constraints that we’ve added during traing using the reward functions. We can sse that the model generates something like this: <think>reasoning</think><answer>solution</answer>. Let’s check the actual solution, to understand if the model is correct.
test_dataset[0]['solution']
'<think>Let me think about this. The angle AOB is given as 120 degrees. To find the sine of this angle, I can use the unit circle or trigonometric identities. In the unit circle, the sine of an angle is the y-coordinate of the corresponding point. For 120 degrees, which is in the second quadrant, the reference angle is 180 - 120 = 60 degrees. The sine of 60 degrees is $\\frac{\\sqrt{3}}{2}$. Since sine is positive in the second quadrant, the sine of 120 degrees is also $\\frac{\\sqrt{3}}{2}$. Therefore, the correct answer is A.</think>\n\n<answer>A</answer>'
It seems like the model has already including some reasoning capabilities to their functionality! Let’s also check the inference time and generated tokens, for further check on the model capabilities.
print(f"Inference time: {inference_duration:.2f} seconds")
print(f"Generated tokens: {num_generated_tokens}")
Inference time: 21.45 seconds
Generated tokens: 500
5. Continuing Your Learning Journey 🧑🎓
The learning journey does not stop here!
If you’re eager on discovering more about GRPO, reasoning or VLMs, we can recommend some materials: