The Reality of RAG

If you're interested in the AI space, you've probably heard about Retrieval Augmented Generation (RAG). Like many others, our team was initially captivated by its potential. This is our story of how we came to understand both its strengths and limitations.

Before we jump in, here's a mini lesson on RAG (Retrieval Augmented Generation):

Imagine two students taking an exam. The first student – let's call this a regular LLM – relies entirely on what they've memorized. They might be brilliant, but they're limited to their general knowledge and might confidently write answers that sound right but contain inaccuracies.

The second student – this is our RAG-powered LLM – is taking an open-book test. When a question comes up:

They quickly flip through their textbooks to find the relevant chapters and passages
They place these helpful references right next to their answer sheet
They then craft their response using both their general understanding and these specific references

It's like the difference between a student who had to memorize everything versus one who can consult specific sources during the test. The RAG student will generally provide more accurate, verifiable answers – especially on specialized topics. The main limitation is that, just like a real student, the AI can only review a limited amount of textbook material at once (this is called the "context window").

Our Initial Excitement

When we first encountered RAG, it seemed like the perfect solution for our AI needs. The concept was elegant: combine the power of large language models with our own internal data to create more accurate, context-aware AI responses.

The Real-World Test: Aloa Manage AI Assistant

We decided to put RAG to the test with an ambitious project: building an AI chatbot that would integrate with Aloa Manage (our internal project management tool) and Slack data. The goal was to create a powerful assistant that would help both our clients and developers navigate project information effortlessly.

Aloa Assistant in Manage (Project management tool)

The Unexpected Challenges

Accuracy and Context Window Limitations

Our first major hurdle came in the form of accuracy issues. The RAG system often retrieved irrelevant data, which quickly consumed the valuable context window of our language model. Think of our RAG student frantically grabbing random textbook pages during the test - some completely off-topic, others only vaguely relevant. With limited desk space, these unhelpful references crowd out the good stuff. So while our assistant had access to our internal data, it was often looking at the wrong parts, leading to responses that missed the mark despite having "sources."

The Debugging Dilemma

When things went wrong (and they did), we found ourselves staring at what felt like a black box. Our system would confidently provide an answer, but when it was incorrect, we had no clear way to understand why. Was it retrieving the wrong documents? Misinterpreting good documents? We couldn't easily peer into its thought process. It's like our student not being able to explain how they arrived at their answer beyond "I used these pages." Without visibility into the system's reasoning, improving performance meant starting from scratch rather than making targeted adjustments.

Scaling Issues with Complex Queries

Simple questions like "When is the next client meeting?" worked fine. But ask something like "What's the overall status of the project and what should we prioritize next?" and things fell apart. One major issue was that vector retrieval pulls disconnected bits and pieces without maintaining any linear data flow. The system grabs relevant-seeming chunks from various sources but loses the coherency between them. This made it extremely difficult for the generative LLM to piece together a coherent narrative when attempting to answer complex questions. Our student might find all the right pages, but with no understanding of how they connect or which should come first. The result was responses that contained correct individual facts but lacked the logical structure and flow needed to truly answer multi-faceted questions.

The Important Lesson

Our experience taught us a valuable distinction: RAG isn't a one-size-fits-all solution. It excels in specific scenarios, particularly:

Quick chatbot implementations that rely on custom external data
Systems heavily reliant on semantic search
Simple question-answering tasks

However, for more complex applications like:

AI recommendation systems
True AI-powered assistants
Systems requiring deep knowledge and multi-step reasoning

A more sophisticated approach is needed. We ended up transitioning to a multi-layered AI workflow that separates reasoning from retrieval, allowing our system to handle complex decision-making with connected rather than fragmented information.

Aloa assistant pulling sources from Slack and Manage using RAG

👉 Want to learn more? Check out more articles and blogs from the Aloa Team.