Skip to main content
BLIP-2 (Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models) is a recent Visual Language Model (VLM) that achieves strong performance on various multimodal tasks by effectively integrating pre-trained vision and language models. BLIP-2 introduces a novel approach to connect visual features from a frozen image encoder to a large language model (LLM) using lightweight Q-Former modules. This design allows BLIP-2 to leverage the strengths of both vision and language models without requiring extensive fine-tuning of the entire architecture. Key references: (Lu et al., 2016; Abu-El-Haija et al., 2016; Szegedy et al., 2015; Xu et al., 2015)

References

  • Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., et al. (2016). YouTube-8M: A Large-Scale Video Classification Benchmark.
  • Lu, J., Xiong, C., Parikh, D., Socher, R. (2016). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision.
  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.