Auraison - Augmenting RAG with Multimodal Resoning

What most people call AI is about to change. We will move from todays’ conversational agents that can help on daily tasks such as

coding,
information retrieval on public or private data,
knowledge distillation

to tasks where agents perform decision making and long-term planning by consuming data streams including real-time multimodal percepts (audio, video, formatted data). Figure 1 provides a good overview (LangChain 2024) of the current state of affairs in the Retrieval Augmented Generation (RAG) built on tope of autoregressive Large Language Models (LLMs). Some of the components shown are mostly in use and others are under active R&D.

Figure 1: Autoregressive LLM based Retrieval Augmented Generation (RAG)

Despite the excitement around enterprise use of such RAG systems, there is a plethora of use cases that such approach is of limited use for a simple but important fact: for many industries that are mission critical, the decision making (the acting) is an essential piece of the problem statement and is currently absent as an output. The reasoning components in other words for autoregressive LLMs not only is delegated to external agents to implement a mixture of experts architecture, but they are incapable of performing any sequential planning towards any non trivial decision.

In addition, although OpenAI and others have released conversational agents that are offer two modes of interaction, as part of their smartphone app, their utility is limited to Automatic Speech Recognition (ASR) aka.a speech to text interface that interrogates an autoregressive LLM. This makes it impossible to be used in multi-party real-time conversations between workers and agents that in addition to information retrieval the end goal is a decision/conclusion. For many industries, internal processes provide the context for the decision making and decision making is not made in vaccum but is based on earlier decisions and the composition of the process. Such contextualization is not possible in todays’ RAG-driven systems and a major reason why is that information is stored in data lakes but the decision/conclusion is oftentimes verbal or implicit.

Explore more ideas

About the author: Pantelis Monogioudis is the CEO of Aegean AI Inc, an AI solutions provider specializing in multimodal perception and a professor in the Department of Data Science at NJIT as well as an adjunct at the Computer Science Department at NYU. His research covers a wide range of topics at the intersection of computer vision, information and communications theory. A Bell Labs alumni, he has tens of years of R&D experience in machine learning and its intersection with mission critical industries such as wireless communications & security.

Copyright and licence: © 2024 Pantelis Monogioudis

Text, code, and figures are licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence, except where otherwise noted. Thumbnail photo by Marc Sendra Martorell on Unsplash.

References

Code your own AI. 2024. “Superhuman RAG: Retrieval Augmented Generation (RAG) Complexity Visualized with New Solutions.” https://www.youtube.com/watch?v=bek8AGvt7dg; Youtube.

LangChain. 2024. “RAG from Scratch: Feedback + Self-Reflection Our RAG from Scratch Video Series Walks Through Impt RAG Concepts in Short / Focused Videos w/ Code. This Is the Final Part in Our Series, Focusing on Self-Reflection + Feedback to Improve RAG Systems.🔧problem: RAG Systems… pic.twitter.com/HhBXgFHl1A.” https://twitter.com/LangChainAI/status/1775569294241472810?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1775569294241472810%7Ctwgr%5E1eb1eaa2494c75662d68ae262f5e691d75d09122%7Ctwcon%5Es1_&ref_url=https%3A%2F%2Fpublish.twitter.com%2F%3Furl%3Dhttps%3A%2F%2Ftwitter.com%2FLangChainAI%2Fstatus%2F1775569294241472810.

Rao, Prashanth. 2023. “Embedded Databases: The Harmony of DuckDB, KùzuDB and LanceDB.” https://thedataquarry.com/posts/embedded-db-1/.