Baseline RAG

Why we start dumb

The whole point of this track is that you ship a RAG system better than Module 1. To know when you are better, you need Module 1 to exist, run, and be measured. That is the only reason this module exists.

We are going to build the stupidest RAG that still answers questions. Nothing fancy. Fixed-size chunks. One embedding model. Cosine top-k. Stuff everything into a Claude prompt and ask it to cite sources. No reranker. No query rewriter. No evaluation harness. No caching. Nothing that would make a good demo.

You will watch this RAG give a confidently wrong answer before the hour is out. That is the lesson. Every later module is motivated by one of the failures you see here. Skip the baseline and you will spend the rest of the track adding techniques without knowing which problem each one solves.

The four moves

Every RAG, no matter how sophisticated, does four things: ingest, retrieve, generate, and log. Module 1 does each one in its simplest form.

Ingest

Read files off disk. For Module 1, we only handle .md and .txt. PDFs, DOCX, HTML — all more annoying, all deferred to Module 2. The question at ingest time is: how do you cut the documents into pieces?

We cut on character count. 1800 characters per chunk, 200 characters of overlap between neighbours. Why character count and not tokens? Because token-aware chunking needs a tokenizer that matches your embedding model, and Module 1 would rather not take that dependency yet. Why 1800 and 200? Because they work. They are also wrong for roughly a third of the questions you will ask. Module 2 is where you meet semantic and hierarchical chunking and see what fixed-size actually costs you.

Each chunk gets an id of the form source/slug#n. The id matters because the model is going to be asked to cite sources by id, and the ship gate is going to grade on whether the cited id appears in the top-k.

Embed

For each chunk, call an embedding model and get back a vector. Module 1 defaults to Voyage's voyage-3. OpenAI's text-embedding-3-small is the fallback if you only have an OpenAI key. Which one is better? It depends on your corpus. That is the whole content of Module 2 — you will compare them against each other on your own documents and see the delta.

For now: one embedding, one provider, one corpus. Write the vectors and the chunk text into Chroma, keyed by chunk id. Chroma runs in its own container in the compose stack and persists to a volume, so you can ingest once and keep querying.

Retrieve

When a question comes in, embed it the same way you embedded the chunks. Ask Chroma for the top five nearest neighbours by cosine similarity. That is it. No reranking. No hybrid with BM25. No query rewriting. Five chunks come out, five chunks go into the prompt.

This is the step that will bite you first. Watch for it on your own corpus.

Generate

Build a prompt that looks roughly like this:

System: You are a careful technical assistant answering from a provided corpus.
Use only the numbered sources. Cite sources inline by id.
If the sources do not contain the answer, say so.

User:
Question: <the question>

Sources:
[1] id=anthropic/prompt-caching
<chunk text>

[2] id=stack-docs/fastapi
<chunk text>

...

Send this to Claude Sonnet 4.6. Read the response. Done.

No decomposition. No agentic loops. No iterative retrieval. One turn. This is called a "stuff" prompt because you stuff all the retrieved chunks in at once. It works when chunks fit in context and when the right chunk is in the retrieved set. When either of those assumptions fails, the answer is wrong or empty. Module 3 and Module 4 exist to make those assumptions fail less often.

Log cost

Every embed call, every query embedding, every generation writes one line to data/costs.jsonl. Tokens in, tokens out, dollars spent. The RAG works without this; you need it anyway.

The reason you need it: when Module 2 asks you to compare three embedding providers across a corpus of five hundred documents, you want the cost answer to be a tail of a log file, not a story. Same for Module 6's cost dashboard. Start logging now, read the log later.

Walk through the reference code

The annotated reference lives at reference/module_1/. It's thin on purpose — one query() function that delegates to the primitives under app/rag/. Open both and read them together.

app/rag/chunk.py — the fixed-size chunker. Forty lines, readable start to finish.
app/rag/embed.py — the Voyage and OpenAI adapters behind a small Embedder protocol.
app/rag/store.py — the Chroma wrapper. Upsert, query, count.
app/rag/ingest.py — the loop that turns a directory into a populated Chroma collection.
app/rag/generate.py — the Claude prompt and the generation call. Read the system prompt carefully; it is the only thing telling the model to cite sources by id.
app/rag/pipeline.py — the answer() function that ties retrieval and generation together.

Read these in order: chunk, embed, store, ingest, generate, pipeline. That is the dataflow.

Your build task

Write learner/module_1/__init__.py. Expose a query(question: str, top_k: int = 5) -> dict function that returns:

{
    "answer": str,
    "sources": [{"id": "...", "text": "...", "score": float}, ...],
    "metrics": {"retrieval_ms": int, "generation_ms": int, "cost_usd": float, ...},
}

Do not copy the reference. Import the primitives (app.rag.embed, app.rag.store, the Anthropic SDK) and wire them yourself. The point is to feel the shape — which arguments go where, which step depends on which, what the failure mode looks like when one of them is off by a character.

Ship gate

Run this when you think you are done:

python content/module-1-baseline/ship-gate.py \
    --learner-module learner.module_1 \
    --fixtures content/module-1-baseline/fixtures.jsonl

The gate loads your module, calls query() for each fixture, and checks:

The return shape matches.
The sources list is non-empty.
At least one returned source id matches the fixture's expected_source_id.
The answer is non-empty.

Pass all fixtures and you unlock Module 2. Passing does not mean your RAG is good — it means the pipes are connected. The boss challenge is where you find out how not-good it is.

Boss challenge tease

The boss is ten adversarial questions designed to break exactly what you just built. Cross-chunk synthesis (the answer lives in two chunks fixed-size chunking separated). Definitional drift (the corpus says "context caching", you ask about "prompt caching"). Multi-hop. Needle-in-haystack with distractors. Negation.

The baseline RAG should score below fifty percent on the boss. That is the intended result and the number you are going to beat in the next five modules. Record it.

See boss.md for the full set.