Prerequisites
You should be comfortable with transformer architectures, attention mechanisms, and the basics of LLM fine-tuning (LoRA, DeepSpeed). Familiarity with CLIP and contrastive learning is helpful but not required.All code in this tutorial runs on a single GPU with ≥24 GB VRAM (e.g. RTX 4090, A5000)
when using 4-bit quantization. The full-precision 13B model requires 8×A100.
The VLM Landscape
Before dissecting LLaVA, it helps to see where it sits among competing approaches. BLIP-2 uses a learned Q-Former with dozens of query tokens to distill visual features before handing them to the LLM. Flamingo uses a Perceiver Resampler to compress visual tokens. LLaVA takes the most direct route: it projects the full set of vision encoder patch embeddings into the LLM’s token embedding space using a simple trainable layer, then lets the LLM attend over both visual and text tokens natively. This simplicity is a feature, not a limitation — it avoids information bottlenecks and lets the LLM decide how to weight visual versus linguistic evidence.LLaVA Architecture
The architecture has three components: a vision encoder, a projection module, and a language model. The vision encoder and LLM are initialized from powerful pre-trained checkpoints and the projector is the bridge trained to translate between their representation spaces.Vision Encoder — CLIP ViT
LLaVA uses OpenAI’s CLIP ViT-L/14 as its visual backbone. CLIP was trained via contrastive learning on 400M image-text pairs, so its patch embeddings already carry rich semantic information aligned (at a coarse level) with natural language. The original LLaVA uses ViT-L/14 at 224×224 resolution, producing a 16×16 grid of patch tokens (256 tokens, each 1024-dim). LLaVA-1.5 upgrades to the 336×336 variant, yielding a 24×24 grid (576 tokens) — a 2.25× increase in visual tokens that meaningfully improves fine-grained understanding.The vision encoder is kept frozen throughout both training stages. Its pre-trained
representations are treated as a fixed visual vocabulary.
Projection Module
This is where the key innovation lies. The original LLaVA uses a single linear projection that maps each CLIP patch embedding from the vision dimension into the LLM’s embedding dimension (4096 for Vicuna-7B, 5120 for Vicuna-13B): LLaVA-1.5 replaces this with a two-layer MLP with GELU activation, providing a non-linear transformation that significantly improves cross-modal alignment: This seemingly small change — from a linear layer to a two-layer MLP — produced substantial benchmark improvements. The insight echoes findings from self-supervised learning (SimCLR, BYOL) where MLP projection heads consistently outperform linear ones for representation alignment.Language Model — Vicuna / LLaMA
The LLM consumes a mixed sequence of visual tokens interleaved with text token embeddings. From the LLM’s perspective, the projected visual tokens are indistinguishable from word embeddings — they share the same dimensionality and occupy positions in the same sequence. The original LLaVA uses Vicuna (a chat-tuned LLaMA). LLaVA-1.5 scales to Vicuna-13B. Later variants swap in LLaMA-3, Qwen-2, and other stronger base models, showing the architecture’s modularity.Two-Stage Training
LLaVA’s training is efficient and conceptually clean: Stage 1 learns the projector in isolation, Stage 2 fine-tunes the full model for instruction following.Stage 1 — Feature Alignment Pre-training
Goal: Teach the MLP projector to translate CLIP patch embeddings into the LLM’s representation space. Data: 558K image-caption pairs filtered from CC3M (LCS-558K). Each sample is a simple(image, caption) pair — no complex instructions yet.
What is trained: Only the MLP projector. Both the CLIP encoder and the LLM
remain frozen.
Intuition: This is a “language grounding” stage. The projector learns that a certain
pattern of CLIP activations (e.g., a dog on grass) should map to token embeddings
that the LLM would associate with text like “a golden retriever sitting on a lawn.”
Cost: Approximately 3.5 hours for the 7B model or 5.5 hours for 13B on a single
8×A100 node.
Stage 2 — Visual Instruction Tuning
Goal: Turn the system into a conversational visual assistant that follows multi-turn instructions. Data (LLaVA): 158K GPT-4-generated visual instruction-following samples built from COCO images — 58K conversations, 23K detailed descriptions, 77K complex reasoning chains. Data (LLaVA-1.5): A richer 665K mixture that adds academic VQA datasets (VQAv2, GQA, OKVQA, A-OKVQA, OCR-VQA, TextCaps), ShareGPT conversations, and region-level perception data. What is trained: The MLP projector and the full LLM are fine-tuned jointly. The CLIP encoder stays frozen. Cost: Approximately 10 hours for 7B or 20 hours for 13B on 8×A100.Instruction Data Generation
A distinctive contribution of the original LLaVA paper is the method for generating visual instruction data. Since GPT-4 (at the time) could not process images, the authors fed it COCO image captions and bounding box annotations as text, then prompted it to generate three types of instruction-following conversations:- Multi-turn conversation — A user asks about the image; the assistant responds naturally, and the conversation continues.
- Detailed description — A thorough account of the image content.
- Complex reasoning — Questions that require inference beyond what is literally visible (e.g., “Why might the person be wearing a raincoat?”).
Key Improvements in LLaVA-1.5
LLaVA-1.5 achieved state-of-the-art on 11 of 12 benchmarks with minimal architectural changes. The gains came from three targeted modifications:Higher Resolution
Switching from CLIP ViT-L/14 at 224px to the 336px variant increases the visual token count from 256 to 576. This is critical for tasks requiring fine-grained perception: reading small text, counting objects, understanding spatial relationships.MLP Projector
As discussed, replacing the linear layer with a two-layer MLP with GELU activation yielded consistent improvements across all benchmarks. The learning rate for pre-training was halved (from 2e-3 to 1e-3) to stabilize MLP training.Academic VQA Data
Adding task-oriented VQA datasets (with response format prompts like “Answer with the option’s letter from the given choices directly”) taught the model to produce both short factual answers and long conversational responses — a capability the original LLaVA lacked.Scaling to High Resolution — LLaVA-1.5-HD
A fixed 336×336 input discards detail in high-resolution images. LLaVA-1.5-HD introduces AnyRes (Any Resolution) processing:- Slice the input image into a grid of patches, each 336×336 (matching the encoder’s native input).
- Encode each patch independently through the frozen CLIP ViT.
- Encode a downsampled version of the full image for global context.
- Concatenate all local patch features with the global features.
- Project the combined features through the MLP into the LLM.
The LLaVA-NeXT and OneVision Lineage
After LLaVA-1.5, the project evolved rapidly: LLaVA-NeXT (January 2024) scaled AnyRes to 4× more pixels and swapped in stronger LLMs (LLaMA-3-8B, Qwen-1.5-72B/110B), yielding models that outperformed Gemini Pro on some benchmarks. LLaVA-NeXT-Video (May 2024) showed that image-only-trained LLaVA-NeXT transferred surprisingly well to video understanding with zero-shot modality transfer, plus DPO training with AI feedback on videos. LLaVA-OneVision (August 2024) unified single-image, multi-image, and video understanding in one model. It uses SigLIP-SO400M as the vision encoder, Qwen2 as the LLM, and a multi-stage training curriculum:- Stage 1: LCS-558K, projector only
- Stage 1.5: 4.7M high-quality synthetic data, full model
- Stage 2 (Single-Image): 3.6M instruction-following samples
- Stage 3 (OneVision): 1.6M single-image + multi-image + video samples
Hands-On: Running LLaVA-1.5 Inference
Let’s load LLaVA-1.5-7B and run visual question answering on a sample image.Hands-On: Running LLaVA-OneVision
For the latest multi-modal capabilities, use LLaVA-OneVision with the Hugging Facetransformers pipeline:
Fine-Tuning LLaVA with LoRA
For domain-specific tasks (medical imaging, remote sensing, industrial inspection), you can fine-tune LLaVA efficiently using LoRA — adapting only low-rank updates to the LLM’s attention weights while keeping everything else frozen.Preparing Your Dataset
LLaVA expects instruction-following data in a specific JSON format:<image>
is a placeholder token replaced by projected visual features at runtime.
LoRA Configuration
Training Script
The official LLaVA repository provides training scripts compatible with DeepSpeed:Merging and Deploying
After training, merge the LoRA weights back into the base model for deployment:Architecture Comparison Table
| Feature | LLaVA (v1) | LLaVA-1.5 | LLaVA-OneVision | LLaVA-OV-1.5 |
|---|---|---|---|---|
| Vision Encoder | CLIP ViT-L/14 224px | CLIP ViT-L/14 336px | SigLIP-SO400M | MLCD ViT + 2D RoPE |
| Projector | Linear | 2-layer MLP (GELU) | 2-layer MLP | 2-layer MLP + pooling |
| LLM | Vicuna-7B/13B | Vicuna-7B/13B | Qwen2 0.5B/7B/72B | Qwen2.5/3 4B/8B |
| Resolution | 224×224 | 336×336 | Up to 2304×2304 | Native resolution |
| Visual Tokens | 256 | 576 | 729 per crop | Dynamic |
| Pre-train Data | CC3M 595K | LCS-558K | LCS-558K + 4.7M | 558K + 85M |
| Instruct Data | 158K | 665K | 3.2M + 1.6M | Multi-stage |
| Modalities | Image | Image | Image + Multi-image + Video | Image + Multi-image + Video |
| Training Cost | ~1 day, 8×A100 | ~1 day, 8×A100 | 256×A100 | ~$16K on A100 |
Key Takeaways
The LLaVA family demonstrates several principles that extend beyond this specific model: Simplicity wins. A direct projection from vision features to LLM token space outperforms more complex cross-modal fusion mechanisms (Q-Former, Perceiver) when combined with high-quality instruction data. Data quality over architecture. The jump from LLaVA to LLaVA-1.5 came primarily from better data mixing (academic VQA tasks, response format prompts) rather than fundamental architectural changes. Modularity enables rapid iteration. Because the vision encoder, projector, and LLM are cleanly separated, each can be upgraded independently. Swapping Vicuna for LLaMA-3 or Qwen2 requires no architectural changes. Visual instruction tuning is more impactful than large-scale pre-training. The LLaVA-1.5 paper showed that a model using only 1.2M publicly available data samples and completing training in one day on a single 8×A100 node could match or exceed models trained on orders of magnitude more data.References
- H. Liu, C. Li, Q. Wu, Y.J. Lee. Visual Instruction Tuning. NeurIPS 2023 (Oral). arXiv:2304.08485
- H. Liu, C. Li, Y. Li, Y.J. Lee. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744
- B. Li et al. LLaVA-OneVision: Easy Visual Task Transfer. TMLR 2025. arXiv:2408.03326
- X. An et al. LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training. arXiv:2509.23661
- A. Radford et al. Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021.
- LLaVA Project Page: https://llava-vl.github.io/
- LLaVA GitHub: https://github.com/haotian-liu/LLaVA
- LLaVA-NeXT GitHub: https://github.com/LLaVA-VL/LLaVA-NeXT

