Skip to main content
In this chapter,
  • We start with CLIP, the classic image-text alignment model that uses contrastive learning,
  • We then cover BLIP-2 that introduces the concept of using a frozen vision encoder (often CLIP or ViT) with a powerful LLM (like FlanT5 or Llama) and a “querying” module to connect the two.
  • Finally, present LLaVA that goes a step further: directly combines a vision encoder (CLIP ViT) with a large language model (Llama 2, Vicuna, etc.) for instruction-following, dialog, and rich vision-language reasoning.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.