Skip to main content
In this assignment you will study how a modern vision-language model (VLM) integrates visual and textual information. The focus is on understanding:
  • How LLaVA connects a vision encoder to a language model
  • How the training pipeline enables multimodal behavior
  • What architectural trade-offs shape vision-language model design

Core References

You are expected to read:
  1. Visual Instruction Tuning (LLaVA) — Liu et al., 2023
  2. Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)
  3. CLIP: Learning Transferable Visual Representations — Radford et al., 2021
Focus on:
  • Architecture (how components connect)
  • Training pipeline (how the model is aligned)

Deliverables

A tutorial-style written report (4-6 pages, .md or .ipynb) parsable in GitHub. The report should guide the reader through each concept with clear explanations, diagrams, and worked examples.

Part 1 — Architecture Understanding

Task 1.1: Forward Pass Explanation

Describe the full data flow in LLaVA:
  1. Image input
  2. Vision encoder (CLIP)
  3. Projection layer
  4. Language model input
  5. Text generation
You should:
  • Provide a simple diagram
  • Explain what each component does
  • Describe how image features become text tokens

Task 1.2: Projection Layer Intuition

LLaVA maps vision features into the language model space: z=Wfvision(x)z = W \cdot f_{\text{vision}}(x) Explain:
  • Why a simple linear or MLP projection works
  • What assumption is made about embedding spaces
  • What could go wrong if alignment is poor

Task 1.3: Key Design Choice

Answer: Why does LLaVA avoid cross-attention (used in models like Flamingo) and instead inject projected tokens directly into the LLM? Discuss in terms of:
  • Simplicity
  • Efficiency
  • Limitations

Part 2 — Training Pipeline

Task 2.1: Two-Stage Training

Explain the two stages:
  1. Feature alignment
  2. Visual instruction tuning
Write the training objective: L=tlogP(ytximage,y<t)\mathcal{L} = - \sum_{t} \log P(y_t \mid x_{\text{image}}, y_{<t}) In simple terms:
  • What is the model learning in each stage?
  • Why are both stages needed?

Task 2.2: Synthetic Data

LLaVA uses GPT-generated instruction data. Answer:
  • Why is synthetic data used instead of human annotation?
  • What biases might this introduce?
  • Does this limit generalization?

Part 3 — Reflection

Answer the following clearly:
  1. Is LLaVA truly multimodal, or is it a language model conditioned on visual features?
  2. Where does alignment actually happen — projection layer or instruction tuning?
  3. What is the biggest limitation of this architecture?

Evaluation Rubric

ComponentWeight
Architecture understanding40%
Training pipeline clarity35%
Reflection25%

Notes

  • Keep explanations precise and grounded in the paper
  • Avoid overly long descriptions; focus on clarity
  • Use figures or diagrams where helpful