RAG & Vector Search

Retrieval-augmented generation, vector databases, chunking

Advanced30 min

Prerequisites

  • large-language-models

LLMs have a problem: they make stuff up. Ask about your company's refund policy and it'll confidently invent one. Ask about a news event from last week and it'll hallucinate details. That's because LLMs only know what they learned during training, which ended months or years ago. They have no access to your private data, your database, or yesterday's headlines.

RAG fixes this. Retrieval-Augmented Generation gives the model actual documents to reference before answering, like giving a student an open-book exam instead of relying on memory. The model searches your knowledge base, pulls the most relevant chunks, and uses them as context when generating the answer. The result is grounded, verifiable, and up-to-date.

This is how ChatGPT's web browsing works. How customer support bots answer questions about your product docs. How search engines are evolving beyond blue links. RAG is the bridge between LLMs and real-world knowledge, and it's one of the most practical patterns in production AI today.

Open-book AI

Imagine you're taking a history exam. In a closed-book exam, you can only use what you memorized. If you forgot a date or mixed up two events, tough luck. But in an open-book exam, you can look things up. You still need to understand the material well enough to find the right page and synthesize an answer, but you're not relying purely on memory.

That's RAG. Without RAG, an LLM is stuck in closed-book mode. It can only draw on patterns it learned during training. With RAG, it gets to consult external sources before answering. It retrieves relevant documents from a knowledge base, reads them, and then generates a response grounded in what it just read. The generation step is still done by the LLM, but now it has actual facts to work with.

RAG Pipeline

User Question
"What is your refund policy?"
Retrieve Docs
vector search
Augment Prompt
add docs as context
Generate Answer
LLM reads docs and responds

The beauty of RAG is that the LLM never needs to memorize your company's refund policy or last quarter's earnings report. Those facts live in a database or a document store, and the model retrieves them on demand. This separation of knowledge (retrieval) from reasoning (generation) is what makes RAG so powerful. You can update your knowledge base without retraining the model, and the model can answer questions about information it has never seen before.

Building a RAG pipeline

A RAG system has two phases: indexing (offline) and retrieval (online). In the indexing phase, you break your documents into chunks, convert them to vectors, and store them in a database. In the retrieval phase, you convert the user's query to a vector, find the most similar chunks, and pass them to the LLM as context. Let's walk through each step.

Chunking — Breaking documents into pieces

LLMs have limited context windows. You can't stuff an entire 500-page manual into the prompt. So you split documents into smaller chunks, typically 200-500 tokens each. Chunk size matters. Too large and the context gets diluted with irrelevant information. Too small and you lose the surrounding meaning.

Too Small
"Our refund policy is..."
Lost context
Just Right
Full policy paragraph
Complete thought
Too Large
Entire 50-page doc
Diluted signal

A common trick is to add overlap between chunks. If you chunk every 300 tokens, start the next chunk 50 tokens before the previous one ends. This ensures that sentences spanning chunk boundaries don't get cut in half. Overlap helps preserve continuity.

Step 1 of 4

Now try it yourself

Below is a RAG pipeline visualizer. We have 5 pre-indexed documents about a company's policies. Select a question, toggle RAG on or off, and run the pipeline. Watch each stage animate: embedding the query, searching for similar documents, retrieving the top-3 most relevant chunks, and generating an answer. Compare the grounded answer (with RAG) to the hallucinated answer (without RAG).

RAG Pipeline Visualizer

See how Retrieval-Augmented Generation retrieves relevant documents before answering. Toggle RAG on/off to compare grounded vs hallucinated answers.

RAG Mode
LLM uses retrieved documents as context
Select a question
Pre-indexed Knowledge Base (5 documents)
Refund Policy
Our company offers a 30-day money-back guarantee on all purchases. No questions asked.
Shipping Information
We ship worldwide within 3-5 business days. Express shipping is available for an additional fee.
Return Process
To return an item, contact support@company.com with your order number. We will send a prepaid label.
Payment Methods
We accept credit cards, PayPal, and Apple Pay. All transactions are encrypted and secure.
Warranty Coverage
All products come with a 1-year warranty against manufacturing defects. Extended warranties available.

AI connection: RAG combines retrieval and generation to give LLMs access to external knowledge. Without RAG, the model can only rely on what it learned during training, leading to hallucinations. With RAG, it retrieves real documents and cites them, producing grounded, verifiable answers. This is how modern AI assistants stay up-to-date and accurate.

Key Takeaways

  • RAG (Retrieval-Augmented Generation) gives LLMs access to external knowledge by retrieving relevant documents before generating an answer. This grounds the response in real data instead of relying on training memory.
  • Chunking strategy matters. Documents are split into 200-500 token chunks with overlap to preserve context. Chunk size is a tradeoff: too large dilutes relevance, too small loses meaning.
  • Vector databases (Pinecone, Weaviate, Chroma) store document embeddings and enable fast similarity search. Given a query vector, they return the K most similar chunks in milliseconds.
  • Hybrid search combines vector similarity (semantic) with keyword matching (BM25) to catch both conceptual and exact matches. Re-ranking with a cross-encoder further improves precision.
  • RAG separates knowledge (retrieval) from reasoning (generation). You can update your knowledge base without retraining the model, making RAG systems easier to maintain and scale.

Common Misconceptions

  • "RAG eliminates hallucinations completely." -- Not quite. RAG reduces hallucinations by providing grounded context, but the LLM can still misinterpret documents or fabricate details not present in the retrieved chunks. You still need guardrails and citations.
  • "Bigger chunk size is always better." -- Nope. Larger chunks mean more context, but also more noise. The LLM has to sift through irrelevant sentences to find the answer. Smaller, focused chunks often yield better retrieval precision.
  • "Vector search is always better than keyword search." -- Not true. Vector search is great for semantic similarity, but it can miss exact keyword matches or domain jargon. Hybrid search (vector + BM25) often outperforms either method alone.
  • "RAG works out of the box." -- Rarely. Production RAG systems require tuning: chunk size, overlap, embedding model choice, top-K value, re-ranking, metadata filtering, and prompt engineering. What works for FAQs may not work for legal docs.

Quick check

Why does chunking strategy (chunk size and overlap) matter in a RAG system?