In this assignment you will study how a modern vision-language model (VLM) integrates visual and textual information. The focus is on understanding:

How LLaVA connects a vision encoder to a language model
How the training pipeline enables multimodal behavior
What architectural trade-offs shape vision-language model design

Core References

You are expected to read:

Visual Instruction Tuning (LLaVA), Liu et al., 2023
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)
CLIP: Learning Transferable Visual Representations, Radford et al., 2021

Focus on:

Architecture (how components connect)
Training pipeline (how the model is aligned)

Deliverables

A tutorial-style written report (4-6 pages, .md or .ipynb) parsable in GitHub. The report should guide the reader through each concept with clear explanations, diagrams, and worked examples.

Part 1, Architecture Understanding

Task 1.1: Forward Pass Explanation

Describe the full data flow in LLaVA:

Image input
Vision encoder (CLIP)
Projection layer
Language model input
Text generation

You should:

Provide a simple diagram
Explain what each component does
Describe how image features become text tokens

Task 1.2: Projection Layer Intuition

LLaVA maps vision features into the language model space:

z = W \cdot f_{\text{vision}}(x)

Explain:

Why a simple linear or MLP projection works
What assumption is made about embedding spaces
What could go wrong if alignment is poor

Task 1.3: Key Design Choice

Answer: Why does LLaVA avoid cross-attention (used in models like Flamingo) and instead inject projected tokens directly into the LLM? Discuss in terms of:

Simplicity
Efficiency
Limitations

Part 2, Training Pipeline

Task 2.1: Two-Stage Training

Explain the two stages:

Feature alignment
Visual instruction tuning

Write the training objective:

\mathcal{L} = - \sum_{t} \log P(y_t \mid x_{\text{image}}, y_{<t})

In simple terms:

What is the model learning in each stage?
Why are both stages needed?

Task 2.2: Synthetic Data

LLaVA uses GPT-generated instruction data. Answer:

Why is synthetic data used instead of human annotation?
What biases might this introduce?
Does this limit generalization?

Part 3, Reflection

Answer the following clearly:

Is LLaVA truly multimodal, or is it a language model conditioned on visual features?
Where does alignment actually happen, projection layer or instruction tuning?
What is the biggest limitation of this architecture?

Evaluation Rubric

Component	Weight
Architecture understanding	40%
Training pipeline clarity	35%
Reflection	25%

Notes

Keep explanations precise and grounded in the paper
Avoid overly long descriptions; focus on clarity
Use figures or diagrams where helpful

Edit this page on GitHub or file an issue.

UAV Drone Detection and Tracking

Faster R-CNN Hyperparameter Optimization with Optuna and W&B

​Core References

​Deliverables

​Part 1, Architecture Understanding

​Task 1.1: Forward Pass Explanation

​Task 1.2: Projection Layer Intuition

​Task 1.3: Key Design Choice

​Part 2, Training Pipeline

​Task 2.1: Two-Stage Training

​Task 2.2: Synthetic Data

​Part 3, Reflection

​Evaluation Rubric

​Notes

Core References

Deliverables

Part 1, Architecture Understanding

Task 1.1: Forward Pass Explanation

Task 1.2: Projection Layer Intuition

Task 1.3: Key Design Choice

Part 2, Training Pipeline

Task 2.1: Two-Stage Training

Task 2.2: Synthetic Data

Part 3, Reflection

Evaluation Rubric

Notes