Visual Instruction Tuning - LLaVA

LLaVa (Large Language and Vision Assistant) is a state-of-the-art Visual Language Model (VLM) that excels in understanding and generating responses based on visual inputs. It builds upon the foundation of large language models by incorporating visual instruction tuning, enabling it to interpret images and provide contextually relevant answers. LLaVa leverages a combination of pre-trained vision encoders and large language models, fine-tuned on a diverse set of multimodal datasets to enhance its ability to follow visual instructions effectively. This makes LLaVa particularly adept at tasks such as image captioning, visual question answering, and other applications that require a deep understanding of both visual and textual information. Key references: (Xu et al., 2015; Johnson et al., 2016; Lu et al., 2016; Vinyals et al., 2016; Anderson et al., 2017)

References

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., et al. (2017). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.
Johnson, J., Hariharan, B., Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., et al. (2016). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.
Lu, J., Xiong, C., Parikh, D., Socher, R. (2016). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2016). Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.

Edit this page on GitHub or file an issue.

Vision-Language Models

​References

References