Vision Language Models: A Deep Dive into the LLaVA Family

Vision Language Models (VLMs) extend the reasoning abilities of large language models to visual inputs — images, charts, documents, and video. Among the open-source VLMs, the LLaVA (Large Language-and-Vision Assistant) family stands out for its elegant architecture, reproducible training recipe, and influence on dozens of follow-up works. This tutorial walks through the full LLaVA lineage: the architecture that connects a frozen vision encoder to an LLM via a lightweight projector, the two-stage training procedure, the key improvements introduced in LLaVA-1.5, and the multi-modal extensions in LLaVA-OneVision. We conclude with a hands-on fine-tuning section using LoRA on a custom dataset.

Prerequisites

You should be comfortable with transformer architectures, attention mechanisms, and the basics of LLM fine-tuning (LoRA, DeepSpeed). Familiarity with CLIP and contrastive learning is helpful but not required.

pip install torch torchvision transformers accelerate pillow requests bitsandbytes peft

All code in this tutorial runs on a single GPU with ≥24 GB VRAM (e.g. RTX 4090, A5000) when using 4-bit quantization. The full-precision 13B model requires 8×A100.

The VLM Landscape

Before dissecting LLaVA, it helps to see where it sits among competing approaches. BLIP-2 uses a learned Q-Former with dozens of query tokens to distill visual features before handing them to the LLM. Flamingo uses a Perceiver Resampler to compress visual tokens. LLaVA takes the most direct route: it projects the full set of vision encoder patch embeddings into the LLM’s token embedding space using a simple trainable layer, then lets the LLM attend over both visual and text tokens natively. This simplicity is a feature, not a limitation — it avoids information bottlenecks and lets the LLM decide how to weight visual versus linguistic evidence.

LLaVA Architecture

The architecture has three components: a vision encoder, a projection module, and a language model. The vision encoder and LLM are initialized from powerful pre-trained checkpoints and the projector is the bridge trained to translate between their representation spaces.

Vision Encoder — CLIP ViT

LLaVA uses OpenAI’s CLIP ViT-L/14 as its visual backbone. CLIP was trained via contrastive learning on 400M image-text pairs, so its patch embeddings already carry rich semantic information aligned (at a coarse level) with natural language. The original LLaVA uses ViT-L/14 at 224×224 resolution, producing a 16×16 grid of patch tokens (256 tokens, each 1024-dim). LLaVA-1.5 upgrades to the 336×336 variant, yielding a 24×24 grid (576 tokens) — a 2.25× increase in visual tokens that meaningfully improves fine-grained understanding.

The vision encoder is kept frozen throughout both training stages. Its pre-trained representations are treated as a fixed visual vocabulary.

Projection Module

This is where the key innovation lies. The original LLaVA uses a single linear projection

W \in \mathbb{R}^{d_v \times d_l}

that maps each CLIP patch embedding from the vision dimension

d_v = 1024

into the LLM’s embedding dimension

d_l

(4096 for Vicuna-7B, 5120 for Vicuna-13B):

\mathbf{h}_i^{\text{vis}} = W \cdot \mathbf{z}_i^{\text{CLIP}}, \quad i = 1, \ldots, N_{\text{patches}}

LLaVA-1.5 replaces this with a two-layer MLP with GELU activation, providing a non-linear transformation that significantly improves cross-modal alignment:

\mathbf{h}_i^{\text{vis}} = W_2 \cdot \text{GELU}(W_1 \cdot \mathbf{z}_i^{\text{CLIP}} + \mathbf{b}_1) + \mathbf{b}_2

This seemingly small change — from a linear layer to a two-layer MLP — produced substantial benchmark improvements. The insight echoes findings from self-supervised learning (SimCLR, BYOL) where MLP projection heads consistently outperform linear ones for representation alignment.

Language Model — Vicuna / LLaMA

The LLM consumes a mixed sequence of visual tokens

[\mathbf{h}_1^{\text{vis}}, \ldots, \mathbf{h}_N^{\text{vis}}]

interleaved with text token embeddings. From the LLM’s perspective, the projected visual tokens are indistinguishable from word embeddings — they share the same dimensionality and occupy positions in the same sequence. The original LLaVA uses Vicuna (a chat-tuned LLaMA). LLaVA-1.5 scales to Vicuna-13B. Later variants swap in LLaMA-3, Qwen-2, and other stronger base models, showing the architecture’s modularity.

Two-Stage Training

LLaVA’s training is efficient and conceptually clean: Stage 1 learns the projector in isolation, Stage 2 fine-tunes the full model for instruction following.

Stage 1 — Feature Alignment Pre-training

Goal: Teach the MLP projector to translate CLIP patch embeddings into the LLM’s representation space. Data: 558K image-caption pairs filtered from CC3M (LCS-558K). Each sample is a simple (image, caption) pair — no complex instructions yet. What is trained: Only the MLP projector. Both the CLIP encoder and the LLM remain frozen. Intuition: This is a “language grounding” stage. The projector learns that a certain pattern of CLIP activations (e.g., a dog on grass) should map to token embeddings that the LLM would associate with text like “a golden retriever sitting on a lawn.” Cost: Approximately 3.5 hours for the 7B model or 5.5 hours for 13B on a single 8×A100 node.

# Pseudocode for Stage 1 forward pass
image_features = clip_encoder(image)         # frozen, [B, 576, 1024]
projected = mlp_projector(image_features)    # trainable, [B, 576, 4096]

caption_tokens = tokenizer(caption)
caption_embeds = llm.embed_tokens(caption_tokens)  # frozen

input_embeds = torch.cat([projected, caption_embeds], dim=1)
loss = llm(inputs_embeds=input_embeds, labels=caption_tokens)  # frozen LLM
loss.backward()  # gradients flow only to mlp_projector

Stage 2 — Visual Instruction Tuning

Goal: Turn the system into a conversational visual assistant that follows multi-turn instructions. Data (LLaVA): 158K GPT-4-generated visual instruction-following samples built from COCO images — 58K conversations, 23K detailed descriptions, 77K complex reasoning chains. Data (LLaVA-1.5): A richer 665K mixture that adds academic VQA datasets (VQAv2, GQA, OKVQA, A-OKVQA, OCR-VQA, TextCaps), ShareGPT conversations, and region-level perception data. What is trained: The MLP projector and the full LLM are fine-tuned jointly. The CLIP encoder stays frozen. Cost: Approximately 10 hours for 7B or 20 hours for 13B on 8×A100.

A common mistake is to unfreeze the vision encoder during Stage 2. This is counterproductive — the pre-trained CLIP representations are high-quality and stable. Unfreezing them risks catastrophic forgetting of visual features while gaining little in task performance.

Instruction Data Generation

A distinctive contribution of the original LLaVA paper is the method for generating visual instruction data. Since GPT-4 (at the time) could not process images, the authors fed it COCO image captions and bounding box annotations as text, then prompted it to generate three types of instruction-following conversations:

Multi-turn conversation — A user asks about the image; the assistant responds naturally, and the conversation continues.
Detailed description — A thorough account of the image content.
Complex reasoning — Questions that require inference beyond what is literally visible (e.g., “Why might the person be wearing a raincoat?”).

This approach — using a language-only model to bootstrap multimodal instruction data — was influential and widely adopted.

Key Improvements in LLaVA-1.5

LLaVA-1.5 achieved state-of-the-art on 11 of 12 benchmarks with minimal architectural changes. The gains came from three targeted modifications:

Higher Resolution

Switching from CLIP ViT-L/14 at 224px to the 336px variant increases the visual token count from 256 to 576. This is critical for tasks requiring fine-grained perception: reading small text, counting objects, understanding spatial relationships.

MLP Projector

As discussed, replacing the linear layer with a two-layer MLP with GELU activation yielded consistent improvements across all benchmarks. The learning rate for pre-training was halved (from 2e-3 to 1e-3) to stabilize MLP training.

Academic VQA Data

Adding task-oriented VQA datasets (with response format prompts like “Answer with the option’s letter from the given choices directly”) taught the model to produce both short factual answers and long conversational responses — a capability the original LLaVA lacked.

Scaling to High Resolution — LLaVA-1.5-HD

A fixed 336×336 input discards detail in high-resolution images. LLaVA-1.5-HD introduces AnyRes (Any Resolution) processing:

Slice the input image into a grid of patches, each 336×336 (matching the encoder’s native input).
Encode each patch independently through the frozen CLIP ViT.
Encode a downsampled version of the full image for global context.
Concatenate all local patch features with the global features.
Project the combined features through the MLP into the LLM.

This supports resolutions up to 672×672 (2×2 grid) without any positional embedding interpolation. LLaVA-OneVision extends this further to up to 2304×2304 with a configurable maximum grid size.

The LLaVA-NeXT and OneVision Lineage

After LLaVA-1.5, the project evolved rapidly: LLaVA-NeXT (January 2024) scaled AnyRes to 4× more pixels and swapped in stronger LLMs (LLaMA-3-8B, Qwen-1.5-72B/110B), yielding models that outperformed Gemini Pro on some benchmarks. LLaVA-NeXT-Video (May 2024) showed that image-only-trained LLaVA-NeXT transferred surprisingly well to video understanding with zero-shot modality transfer, plus DPO training with AI feedback on videos. LLaVA-OneVision (August 2024) unified single-image, multi-image, and video understanding in one model. It uses SigLIP-SO400M as the vision encoder, Qwen2 as the LLM, and a multi-stage training curriculum:

Stage 1: LCS-558K, projector only
Stage 1.5: 4.7M high-quality synthetic data, full model
Stage 2 (Single-Image): 3.6M instruction-following samples
Stage 3 (OneVision): 1.6M single-image + multi-image + video samples

Available in 0.5B, 7B, and 72B parameter configurations. LLaVA-OneVision-1.5 (September 2025) retains the ViT–MLP–LLM paradigm but introduces a region-aware vision encoder with 2D rotary positional encoding, native-resolution image processing, and a Megatron-LM training framework with MoE and FP8 support. The 8B model outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, with total training cost around $16,000 on A100 GPUs.

Hands-On: Running LLaVA-1.5 Inference

Let’s load LLaVA-1.5-7B and run visual question answering on a sample image.

import torch
from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "llava-hf/llava-1.5-7b-hf"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    load_in_4bit=True,  # quantize for single-GPU inference
)

processor = AutoProcessor.from_pretrained(model_id)

# Load an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Build the prompt using the expected format
prompt = "USER: <image>\nDescribe this image in detail. What objects are present and how are they arranged?\nASSISTANT:"

inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    output = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
    )

response = processor.decode(output[0], skip_special_tokens=True)
print(response.split("ASSISTANT:")[-1].strip())

Hands-On: Running LLaVA-OneVision

For the latest multi-modal capabilities, use LLaVA-OneVision with the Hugging Face transformers pipeline:

from transformers import pipeline

pipe = pipeline(
    "image-text-to-text",
    model="llava-hf/llava-onevision-qwen2-7b-ov-hf",
    torch_dtype=torch.float16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "What are these animals doing? Count them."},
        ],
    },
]

result = pipe(text=messages, max_new_tokens=128)
print(result[0]["generated_text"][-1]["content"])

Fine-Tuning LLaVA with LoRA

For domain-specific tasks (medical imaging, remote sensing, industrial inspection), you can fine-tune LLaVA efficiently using LoRA — adapting only low-rank updates to the LLM’s attention weights while keeping everything else frozen.

Preparing Your Dataset

LLaVA expects instruction-following data in a specific JSON format:

[
  {
    "id": "sample_001",
    "image": "images/xray_001.png",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nDescribe the findings in this chest X-ray."
      },
      {
        "from": "gpt",
        "value": "The chest X-ray shows clear lung fields bilaterally..."
      }
    ]
  }
]

Each sample contains an image path and a multi-turn conversation where <image> is a placeholder token replaced by projected visual features at runtime.

LoRA Configuration

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=128,
    lora_alpha=256,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Example output: trainable params: 159,907,840 || all params: 7,222,XXX,XXX || trainable: 2.21%

Training Script

The official LLaVA repository provides training scripts compatible with DeepSpeed:

# LoRA fine-tuning on 8× RTX 3090 (or 8× A6000 for 13B)
deepspeed llava/train/train_mem.py \
    --deepspeed scripts/zero2.json \
    --lora_enable True \
    --lora_r 128 \
    --lora_alpha 256 \
    --model_name_or_path liuhaotian/llava-v1.5-7b \
    --version v1 \
    --data_path /path/to/your/data.json \
    --image_folder /path/to/images/ \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-7b-lora-custom \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --report_to wandb

Keep per_device_train_batch_size × gradient_accumulation_steps × num_gpus consistent with the original training recipe for best reproducibility. The official 7B LoRA config uses a global batch size of 128.

Merging and Deploying

After training, merge the LoRA weights back into the base model for deployment:

from peft import PeftModel

base_model = LlavaForConditionalGeneration.from_pretrained(
    "liuhaotian/llava-v1.5-7b",
    torch_dtype=torch.float16,
)
model = PeftModel.from_pretrained(base_model, "./checkpoints/llava-v1.5-7b-lora-custom")
merged = model.merge_and_unload()
merged.save_pretrained("./llava-v1.5-7b-merged-custom")

Architecture Comparison Table

Feature	LLaVA (v1)	LLaVA-1.5	LLaVA-OneVision	LLaVA-OV-1.5
Vision Encoder	CLIP ViT-L/14 224px	CLIP ViT-L/14 336px	SigLIP-SO400M	MLCD ViT + 2D RoPE
Projector	Linear	2-layer MLP (GELU)	2-layer MLP	2-layer MLP + pooling
LLM	Vicuna-7B/13B	Vicuna-7B/13B	Qwen2 0.5B/7B/72B	Qwen2.5/3 4B/8B
Resolution	224×224	336×336	Up to 2304×2304	Native resolution
Visual Tokens	256	576	729 per crop	Dynamic
Pre-train Data	CC3M 595K	LCS-558K	LCS-558K + 4.7M	558K + 85M
Instruct Data	158K	665K	3.2M + 1.6M	Multi-stage
Modalities	Image	Image	Image + Multi-image + Video	Image + Multi-image + Video
Training Cost	~1 day, 8×A100	~1 day, 8×A100	256×A100	~$16K on A100

Key Takeaways

The LLaVA family demonstrates several principles that extend beyond this specific model: Simplicity wins. A direct projection from vision features to LLM token space outperforms more complex cross-modal fusion mechanisms (Q-Former, Perceiver) when combined with high-quality instruction data. Data quality over architecture. The jump from LLaVA to LLaVA-1.5 came primarily from better data mixing (academic VQA tasks, response format prompts) rather than fundamental architectural changes. Modularity enables rapid iteration. Because the vision encoder, projector, and LLM are cleanly separated, each can be upgraded independently. Swapping Vicuna for LLaMA-3 or Qwen2 requires no architectural changes. Visual instruction tuning is more impactful than large-scale pre-training. The LLaVA-1.5 paper showed that a model using only 1.2M publicly available data samples and completing training in one day on a single 8×A100 node could match or exceed models trained on orders of magnitude more data.

References

H. Liu, C. Li, Q. Wu, Y.J. Lee. Visual Instruction Tuning. NeurIPS 2023 (Oral). arXiv:2304.08485
H. Liu, C. Li, Y. Li, Y.J. Lee. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744
B. Li et al. LLaVA-OneVision: Easy Visual Task Transfer. TMLR 2025. arXiv:2408.03326
X. An et al. LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training. arXiv:2509.23661
A. Radford et al. Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021.
LLaVA Project Page: https://llava-vl.github.io/
LLaVA GitHub: https://github.com/haotian-liu/LLaVA
LLaVA-NeXT GitHub: https://github.com/LLaVA-VL/LLaVA-NeXT

Edit this page on GitHub or file an issue.

Vision-Language Models

​Prerequisites

​The VLM Landscape

​LLaVA Architecture

​Vision Encoder — CLIP ViT

​Projection Module

​Language Model — Vicuna / LLaMA

​Two-Stage Training

​Stage 1 — Feature Alignment Pre-training

​Stage 2 — Visual Instruction Tuning

​Instruction Data Generation

​Key Improvements in LLaVA-1.5

​Higher Resolution

​MLP Projector

​Academic VQA Data

​Scaling to High Resolution — LLaVA-1.5-HD

​The LLaVA-NeXT and OneVision Lineage

​Hands-On: Running LLaVA-1.5 Inference

​Hands-On: Running LLaVA-OneVision

​Fine-Tuning LLaVA with LoRA

​Preparing Your Dataset

​LoRA Configuration

​Training Script

​Merging and Deploying

​Architecture Comparison Table

​Key Takeaways

​References

Prerequisites

The VLM Landscape

LLaVA Architecture

Vision Encoder — CLIP ViT

Projection Module

Language Model — Vicuna / LLaMA

Two-Stage Training

Stage 1 — Feature Alignment Pre-training

Stage 2 — Visual Instruction Tuning

Instruction Data Generation

Key Improvements in LLaVA-1.5

Higher Resolution

MLP Projector

Academic VQA Data

Scaling to High Resolution — LLaVA-1.5-HD

The LLaVA-NeXT and OneVision Lineage

Hands-On: Running LLaVA-1.5 Inference

Hands-On: Running LLaVA-OneVision

Fine-Tuning LLaVA with LoRA

Preparing Your Dataset

LoRA Configuration

Training Script

Merging and Deploying

Architecture Comparison Table

Key Takeaways

References