Build Your First RAG App: A Practical Python Tutorial

Last updated: June 2026

Quick answer

To build your first RAG app, you need five steps in Python: load your documents, chunk them into smaller passages, embed each chunk into a vector, store the vectors in a database like Chroma, then at query time retrieve the closest chunks and pass them to Claude as context. The full minimum viable RAG app fits in about 80 lines of Python. The hard part is not the code. It is choosing what to retrieve.

TL;DR

RAG is five steps: load, chunk, embed, retrieve, generate. Everything else is tuning.
A working RAG app in Python fits in 80 lines using Chroma for storage and the Anthropic SDK for generation with claude-opus-4-7.
The bottleneck is retrieval quality, not the LLM. If your RAG answers are bad, your chunks are wrong before your prompt is wrong.

Who this is for

This article is for working developers, analysts, and product managers who have written some Python and want to build their first retrieval-augmented generation system end-to-end. If you are an engineer adding AI to an existing job, a PM building an internal prototype, or a career changer working through the LLM stack, this tutorial is for you.

If you do not yet know what RAG is conceptually, start with what is RAG explained in plain English first, then come back here to build it.

What you will build

You will build a Python RAG app that answers questions about a small set of internal documents. Picture a folder of company markdown files: HR policies, engineering runbooks, product specs. You want a single command-line tool where you ask "what is our remote work policy" and Claude answers using only your documents, with the source passages cited.

This is the exact use case most of my LLM Engineering students start with. The LLM Engineering course has 5 BIG portfolio-ready projects, and a private-document RAG is usually the first one.

The stack:

Python 3.11+ as the runtime
Anthropic SDK (anthropic>=0.39) for embeddings and generation with claude-opus-4-7
Chroma (chromadb) as the local vector store
A few markdown files as the corpus

You can swap Chroma for Qdrant, Pinecone, or Postgres with pgvector later. For your first RAG app, run-it-on-your-laptop is the right choice.

Step 1: Set up the project

Open a fresh folder and install the dependencies. I am using uv here, but pip works the same way.

uv init rag-app
cd rag-app
uv add anthropic chromadb

Drop a few .md files into a docs/ subfolder. Anything will do. For this tutorial, imagine you have three files: remote-work-policy.md, vacation-policy.md, expense-policy.md.

Set your Anthropic key in the environment:

export ANTHROPIC_API_KEY="sk-ant-..."

(On Windows PowerShell use $env:ANTHROPIC_API_KEY = "sk-ant-...".)

Step 2: Load the documents

The "load" step is the most boring and the most underrated. Garbage in, garbage out. For markdown files, plain text reads are fine. For PDFs, use a library like pypdf. For HTML, strip the markup first.

# load.py
from pathlib import Path

def load_documents(folder: str) -> list[dict]:
    """Read every .md file in folder and return source-tagged records."""
    docs = []
    for path in Path(folder).glob("*.md"):
        docs.append({"source": path.name, "text": path.read_text(encoding="utf-8")})
    return docs

if __name__ == "__main__":
    docs = load_documents("docs")
    print(f"Loaded {len(docs)} documents")

Each document carries its filename as source. That source tag is what lets your final answer cite where it came from. Always keep it.

Step 3: Chunk the text

LLMs work best with focused passages. If you embed an entire 5,000-word policy as one vector, retrieval becomes mush. Break each document into chunks of roughly 500 to 800 characters with about 100 characters of overlap.

# chunk.py
def chunk_text(text: str, size: int = 700, overlap: int = 100) -> list[str]:
    """Split text into overlapping chunks of roughly 'size' characters."""
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + size, len(text))
        chunks.append(text[start:end])
        start += size - overlap
    return chunks

This is the simplest possible chunker. It splits by character count and overlaps a bit so that ideas that span a chunk boundary still get captured in at least one window. For production you would chunk by paragraph or sentence boundary, but for your first RAG app this works.

Combine load and chunk:

from load import load_documents
from chunk import chunk_text

records = []
for doc in load_documents("docs"):
    for i, piece in enumerate(chunk_text(doc["text"])):
        records.append({
            "id": f"{doc['source']}::chunk-{i}",
            "source": doc["source"],
            "text": piece,
        })

print(f"Created {len(records)} chunks")

Step 4: Embed and store the chunks

An embedding is a numerical vector that captures the meaning of a passage. Two passages with similar meaning end up near each other in vector space. That is the whole magic of retrieval.

We will use Chroma's built-in default embedding function so your first app does not need a separate embedding API. Once you understand the pipeline, you can swap in OpenAI embeddings or Voyage AI embeddings (which the Anthropic docs recommend for production Claude RAG).

# index.py
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="company_docs")

# records is the list from Step 3
collection.add(
    ids=[r["id"] for r in records],
    documents=[r["text"] for r in records],
    metadatas=[{"source": r["source"]} for r in records],
)

print(f"Indexed {collection.count()} chunks")

Chroma handles the embedding under the hood when you pass raw text. The data persists to ./chroma_db, so you only run this once per document change.

Step 5: Retrieve and generate

This is the part where most people overcomplicate things. The retrieval step is just a similarity search. The generation step is just an Anthropic API call with the retrieved passages stuffed into the system prompt.

# ask.py
import os
import chromadb
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_collection(name="company_docs")

SYSTEM_PROMPT = """You answer questions using only the provided context.
If the context does not contain the answer, say so clearly.
Always cite the source filename for any claim you make."""

def ask(question: str, k: int = 4) -> str:
    """Retrieve top-k chunks and ask Claude to answer using them."""
    results = collection.query(query_texts=[question], n_results=k)
    chunks = results["documents"][0]
    sources = [m["source"] for m in results["metadatas"][0]]

    context = "\n\n".join(
        f"[Source: {src}]\n{txt}" for src, txt in zip(sources, chunks)
    )

    message = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=[
            {
                "role": "user",
                "content": f"Context:\n\n{context}\n\nQuestion: {question}",
            }
        ],
    )
    return message.content[0].text

if __name__ == "__main__":
    print(ask("What is our remote work policy?"))

That is your first RAG app. Run python ask.py and Claude will answer using only the chunks Chroma returned. The system prompt forces it to cite the source filename and to say "I do not know" when the context lacks the answer. Both behaviors are non-negotiable for trustworthy RAG.

How do you test that your RAG app actually works?

A working RAG app should fail on questions outside the corpus. That is not a bug. That is the feature.

Try three kinds of question:

Inside the corpus and specific. "What is our vacation policy for new hires?" Claude should answer correctly and cite vacation-policy.md.
Inside the corpus but vague. "What does our company think about flexibility?" Claude should synthesize across the relevant chunks.
Outside the corpus. "What is the capital of France?" Claude should say it does not have that information in the provided context.

If question 3 returns "Paris," your system prompt is being ignored. Tighten it.

Where the LLM Engineering course goes from here

This article covers the minimum viable RAG. In the LLM Engineering course we extend this exact same pipeline across 5 BIG portfolio-ready projects:

A private-document RAG over a real internal corpus with source citation
A RAG with hybrid retrieval (keyword plus vector) for better recall
A RAG with re-ranking using a cross-encoder model
An agentic RAG where Claude decides when to retrieve versus when to answer directly
A production RAG with evaluation, logging, and a real user interface

Each project ships to GitHub with a clean README. By the end you have a portfolio that proves you can ship LLM systems, not just prompt them. That distinction matters a lot in the 2026 job market.

What does a real RAG look like in production?

The 80-line app above is real, but production RAG has more moving parts. The ones that matter most, in order of return on effort:

Better chunking. Chunk by paragraph or sentence, not character count. Use a library like langchain_text_splitters or write a 30-line splitter that respects markdown headings.
Better embeddings. Voyage AI's voyage-3 is what Anthropic recommends for Claude RAG in production. The free tier is enough to evaluate.
Re-ranking. Retrieve 20 chunks, then re-rank them with a cross-encoder, then keep the top 4 for the LLM. This single change usually gives the biggest accuracy jump.
Citation in the answer. Force Claude to quote the chunk it used. This makes hallucinations easy to spot.
Evaluation. Build a 50-question test set with known answers. Run it after every change. Without this, you are guessing.

These are the upgrades my students work through after they have shipped the minimum viable version.

Common mistakes I see

Chunks too big or too small. Chunks of 2,000+ characters dilute the embedding. Chunks of 100 characters lose context. Start at 500 to 800 characters with overlap, then tune based on your corpus.
No source citation in the answer. Without "Source: filename.md" in the output, you cannot verify the answer, and you cannot trust it. Force it in the system prompt.
Skipping evaluation. People tweak chunk size, embedding model, top-k, and prompt without measuring. After three changes they have no idea which one helped. Build a small test set on day one.

What to do next

Pick the path that matches your situation.

If you have not run the code above yet, do that first. Copy each step into a file, point it at three real markdown documents, and get the first answer back. That single completed loop is worth ten more hours of reading.

If your RAG works but the answers are inconsistent, the issue is almost always retrieval, not the LLM. Print the retrieved chunks before the API call and read them. You will see immediately whether the right context is being pulled.

If you want a structured path from here to production RAG systems, that is exactly what the LLM Engineering course is built for. Book a free 15-minute Discovery Call and we will map your starting point.

Frequently Asked Questions

Do I need a paid vector database for my first RAG app?

No. Chroma runs locally with zero infrastructure and is fine for tens of thousands of chunks. Move to Qdrant or Pinecone only when you have a clear reason (multi-tenant filtering, very large corpora, hosted SLAs). For learning, local Chroma is the right call.

Which embedding model should I use with Claude?

Anthropic does not ship an embedding model. They recommend Voyage AI's voyage-3 for production RAG with Claude. For your first app, Chroma's default embedding (a small sentence-transformer) is fine and free. Upgrade to Voyage when accuracy matters.

Can I use this with private company data?

Yes, with care. Run Chroma locally so your documents never leave your machine. When you call the Anthropic API, only the retrieved chunks plus the question travel over the network, not your entire corpus. For regulated data, check your company's data policy and Anthropic's enterprise terms first.

How big can the corpus get before I need to upgrade?

Local Chroma handles 100,000+ chunks comfortably on a laptop. The bottleneck is usually retrieval quality at scale, not Chroma's performance. Once you cross a few hundred thousand chunks or need real-time concurrent users, move to a hosted vector database.

Is RAG better than fine-tuning Claude?

For almost every use case in 2026, yes. RAG is cheaper, faster to update, easier to debug, and does not require training infrastructure. Fine-tuning makes sense when you need a specific tone or style at scale, not when you need the model to know your documents. Update RAG documents in seconds. Re-fine-tuning a model takes hours and burns money.

Why does my RAG answer questions outside my documents?

Your system prompt is too permissive, or Claude is using its parametric knowledge. The fix is a strict system prompt: "Answer only from the provided context. If the context lacks the answer, say so." Plus a low temperature on generation. The example in Step 5 includes the right constraint, but production prompts can be even stricter.

Ready to move from reading to building?

If you are serious about LLM engineering, stop consuming content and start working with a tutor who will hold you accountable through the 5 portfolio-ready projects in the LLM Engineering course. Book a free 15-minute Discovery Call. No pitch, just a conversation about your goals.

Book a Free Discovery Call →

Written by AI Tutor Code, private 1-on-1 online tutoring for professionals learning Python, AI, and modern ML tools. 200+ students taught. 3,000+ hours of private tutoring delivered. 4.9/5 average rating.