Is RAG Dead? Why Long Context Windows Haven't Killed Retrieval Augmented Generation
LLMs now support million-token context windows, so teams are questioning whether RAG is still worth the infrastructure complexity. Here's a breakdown of the real trade-offs.
LLMs are stuck in time
Here's something easy to overlook: large language models are frozen at their training cutoff. An LLM knows nothing about what happened five minutes ago — and nothing about your private data, your internal wikis, or your proprietary codebase.
This creates the context injection problem: how do you get the right data into the model at the right time? Two pretty different approaches have emerged.
Approach 1: RAG
RAG is the engineering approach. You chunk your documents into smaller pieces, embed each chunk into vectors, store those vectors in a database, then at query time retrieve the most relevant chunks and inject them into the context window alongside the user's prompt.
It works. But it rests on an assumption that often goes unexamined — that your retrieval actually found the right information.
Approach 2: Long context
Long context is brute force. Skip the database, skip the embedding model, dump documents directly into the context window and let the model figure it out.
For a long time this wasn't viable. Early LLMs had 4K token context windows — you couldn't fit a novel, let alone a corporate knowledge base. Today's models support 1M+ tokens. That's roughly 700,000 words, enough for the entire Lord of the Rings series plus The Hobbit.
So do we still need RAG?
The case for long context
The obvious win is infrastructure. A production RAG system has a lot of moving parts: a chunking strategy (fixed size? sliding window? recursive?), an embedding model, a vector database, a reranker, sync logic to keep everything current. Long context removes all of it. The architecture becomes: get the data, send it to the model.
The less obvious win is failure modes. RAG has a quiet one — silent failure. Semantic search is probabilistic, and sometimes it just returns the wrong results. The answer existed in the data, the model never saw it, and nothing tells you that happened. With long context there's no retrieval step to go wrong.
The most interesting limitation, though, is a class of question RAG genuinely can't answer: questions about what's missing. Say you have product requirements in one document and release notes in another, and you ask "which security requirements got cut from the final release." RAG will surface snippets about security and requirements — but it can't retrieve the gap between them. That reasoning requires both documents in full, side by side. You can't chunk your way to that answer.
Key takeaway: For bounded datasets where you need to reason across the whole thing, long context is often simpler and more reliable.
The case for RAG
The cost argument is blunt. Putting a 500-page manual in a context window costs roughly 250K tokens — on every query. RAG pays that cost once at indexing time. At any real query volume, the difference adds up.
Less obvious is what happens to attention as context grows. The intuition is that if data is in the window, the model will use it. Past a few hundred thousand tokens, that breaks down. Ask a specific question about one paragraph buried in a 2,000-page document and the model often misses it — sometimes hallucinating from nearby text instead. RAG narrows the window to the relevant chunks and actually focuses the model on what matters.
And then raw scale just ends the debate. A million tokens sounds enormous until you remember that enterprise data lakes are measured in terabytes. No context window is touching that. If you need to query across an organization's full knowledge base, you need a retrieval layer — there's no getting around it.
Key takeaway: For large datasets, dynamic data, or high query volume, RAG is still the only option that fits.
So which should you use?
| Use case | Approach |
|---|---|
| Bounded dataset, global reasoning (legal analysis, book summarization) | Long context |
| Large enterprise knowledge base | RAG |
| Dynamic, frequently changing data | RAG |
| Small team, minimal infrastructure overhead | Long context |
Long context wins when the dataset fits in a window and the task requires reasoning across the whole thing. RAG wins when data is too large, changes often, or when token costs at scale matter.
Most serious systems end up using both — long context for high-stakes reasoning, RAG underneath as the retrieval layer that keeps the window manageable.
Based on "Is RAG Still Needed? Choosing the Best Approach for LLMs" by Martin Keen, IBM Technology.