Skip to main content
In this assignment you will study how a modern vision-language model (VLM) integrates visual and textual information. The focus is on understanding:
  • How LLaVA connects a vision encoder to a language model
  • How the training pipeline enables multimodal behavior
  • What architectural trade-offs shape vision-language model design

Core References

You are expected to read:
  1. Visual Instruction Tuning (LLaVA), Liu et al., 2023
  2. Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)
  3. CLIP: Learning Transferable Visual Representations, Radford et al., 2021
Focus on:
  • Architecture (how components connect)
  • Training pipeline (how the model is aligned)

Deliverables

A tutorial-style written report (4-6 pages, .md or .ipynb) parsable in GitHub. The report should guide the reader through each concept with clear explanations, diagrams, and worked examples.

Part 1, Architecture Understanding

Task 1.1: Forward Pass Explanation

Describe the full data flow in LLaVA:
  1. Image input
  2. Vision encoder (CLIP)
  3. Projection layer
  4. Language model input
  5. Text generation
You should:
  • Provide a simple diagram
  • Explain what each component does
  • Describe how image features become text tokens

Task 1.2: Projection Layer Intuition

LLaVA maps vision features into the language model space: z=Wfvision(x)z = W \cdot f_{\text{vision}}(x) Explain:
  • Why a simple linear or MLP projection works
  • What assumption is made about embedding spaces
  • What could go wrong if alignment is poor

Task 1.3: Key Design Choice

Answer: Why does LLaVA avoid cross-attention (used in models like Flamingo) and instead inject projected tokens directly into the LLM? Discuss in terms of:
  • Simplicity
  • Efficiency
  • Limitations

Part 2, Training Pipeline

Task 2.1: Two-Stage Training

Explain the two stages:
  1. Feature alignment
  2. Visual instruction tuning
Write the training objective: L=tlogP(ytximage,y<t)\mathcal{L} = - \sum_{t} \log P(y_t \mid x_{\text{image}}, y_{<t}) In simple terms:
  • What is the model learning in each stage?
  • Why are both stages needed?

Task 2.2: Synthetic Data

LLaVA uses GPT-generated instruction data. Answer:
  • Why is synthetic data used instead of human annotation?
  • What biases might this introduce?
  • Does this limit generalization?

Part 3, Reflection

Answer the following clearly:
  1. Is LLaVA truly multimodal, or is it a language model conditioned on visual features?
  2. Where does alignment actually happen, projection layer or instruction tuning?
  3. What is the biggest limitation of this architecture?

Evaluation Rubric

ComponentWeight
Architecture understanding40%
Training pipeline clarity35%
Reflection25%

Notes

  • Keep explanations precise and grounded in the paper
  • Avoid overly long descriptions; focus on clarity
  • Use figures or diagrams where helpful