Visual Language Models

In this chapter,

We start with CLIP, the classic image-text alignment model that uses contrastive learning,
We then cover BLIP-2 that introduces the concept of using a frozen vision encoder (often CLIP or ViT) with a powerful LLM (like FlanT5 or Llama) and a “querying” module to connect the two.
Finally, present LLaVA that goes a step further: directly combines a vision encoder (CLIP ViT) with a large language model (Llama 2, Vicuna, etc.) for instruction-following, dialog, and rich vision-language reasoning.

Key references: (Lu et al., 2016; Johnson et al., 2016; Xu et al., 2015; Vinyals et al., 2016)

References

Johnson, J., Hariharan, B., Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., et al. (2016). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.
Lu, J., Xiong, C., Parikh, D., Socher, R. (2016). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2016). Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.