- How LLaVA connects a vision encoder to a language model
- How the training pipeline enables multimodal behavior
- What architectural trade-offs shape vision-language model design
Core References
You are expected to read:- Visual Instruction Tuning (LLaVA) — Liu et al., 2023
- Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)
- CLIP: Learning Transferable Visual Representations — Radford et al., 2021
- Architecture (how components connect)
- Training pipeline (how the model is aligned)
Deliverables
A tutorial-style written report (4-6 pages,.md or .ipynb) parsable in GitHub. The report should guide the reader through each concept with clear explanations, diagrams, and worked examples.
Part 1 — Architecture Understanding
Task 1.1: Forward Pass Explanation
Describe the full data flow in LLaVA:- Image input
- Vision encoder (CLIP)
- Projection layer
- Language model input
- Text generation
- Provide a simple diagram
- Explain what each component does
- Describe how image features become text tokens
Task 1.2: Projection Layer Intuition
LLaVA maps vision features into the language model space: Explain:- Why a simple linear or MLP projection works
- What assumption is made about embedding spaces
- What could go wrong if alignment is poor
Task 1.3: Key Design Choice
Answer: Why does LLaVA avoid cross-attention (used in models like Flamingo) and instead inject projected tokens directly into the LLM? Discuss in terms of:- Simplicity
- Efficiency
- Limitations
Part 2 — Training Pipeline
Task 2.1: Two-Stage Training
Explain the two stages:- Feature alignment
- Visual instruction tuning
- What is the model learning in each stage?
- Why are both stages needed?
Task 2.2: Synthetic Data
LLaVA uses GPT-generated instruction data. Answer:- Why is synthetic data used instead of human annotation?
- What biases might this introduce?
- Does this limit generalization?
Part 3 — Reflection
Answer the following clearly:- Is LLaVA truly multimodal, or is it a language model conditioned on visual features?
- Where does alignment actually happen — projection layer or instruction tuning?
- What is the biggest limitation of this architecture?
Evaluation Rubric
| Component | Weight |
|---|---|
| Architecture understanding | 40% |
| Training pipeline clarity | 35% |
| Reflection | 25% |
Notes
- Keep explanations precise and grounded in the paper
- Avoid overly long descriptions; focus on clarity
- Use figures or diagrams where helpful

