Evaluation framework

Every module has been handing you numbers. Recall@5. MRR. Judge score. Cost. Latency. You have been using those numbers to make choices and to pass ship gates. You did not write the code that produces them. A reference harness in app/evals/ has been running behind the scenes, and you have been trusting its output.

Module 5 makes you build the harness. You replace the ghost with a system you own, understand, and can extend. By the end of this module, you have something you would use to gate production deploys at a company — not a notebook, a pipeline.

Why this is Module 5 and not Module 1

A junior engineer's instinct is to build the eval framework first. Measure before you ship. It sounds responsible.

It is wrong. Here is why.

You do not know what to measure until you have seen your system fail on real questions on a real corpus. Retrieval metrics are easy to compute and hard to interpret in the abstract. Judge scores are expensive to run and their value depends on what you are judging. If you build the eval framework before you have a working RAG, you will over-index on the easy metrics (latency, cost) and under-index on the hard ones (faithfulness, groundedness) because you have no examples of when the hard ones matter.

By Module 5, you have four module boss challenges worth of data. You have felt the difference between a question that fails because of retrieval and a question that fails because of generation. You know what kind of failure hurts most on your corpus. Now you build the thing that counts those failures.

This is the order good engineering teams actually end up following. They ship a working system, break it a few ways, and then build the test harness that catches those break patterns. Reverse the order and you build a harness that tests the wrong things.

Four pieces, one pipeline

Piece one — retrieval metrics

Three functions. Each takes a list of retrieved source ids and an expected source id, returns a number.

Recall@k — 1 if the expected id is in the top k, 0 otherwise. Mean across a question set.

Mean Reciprocal Rank (MRR) — 1/rank of the expected id in the retrieved list, 0 if not retrieved. Mean across a question set. Rewards ranking the answer high, not just getting it in the top-k somewhere.

nDCG@k — optional, more nuanced. Penalizes answers lower in the list logarithmically. Useful when you have graded relevance (some retrieved chunks are partially relevant). For binary relevance, MRR tells you most of what nDCG would.

Write all three. Thirty lines of code total. Keep them pure — take ids in, return numbers out. No side effects.

Piece two — LLM-as-judge

Retrieval metrics only tell you whether the right chunk was retrieved. They say nothing about whether the generated answer is correct, grounded, complete, or useful. For that you need a judge.

The pattern: send the question, the reference answer (from your fixtures), and the candidate answer to a cheap model. Ask it to score on faithfulness (does the candidate only make claims supported by the reference?) and completeness (does the candidate cover the full reference answer?).

Haiku 4.5 as judge. Prompt looks roughly:

You are evaluating a generated answer against a reference answer.

Question: {question}
Reference: {reference_answer}
Candidate: {candidate_answer}

Score the candidate on two axes, 0-5 integer each:

Faithfulness (0 = fabricated, 5 = entirely supported by reference)
Completeness (0 = nothing covered, 5 = fully covers reference)

Output exactly this JSON: {"faithfulness": N, "completeness": N, "reason": "..."}

Two practical gotchas.

Judge drift. Judge prompts are code. Version them. Freeze them when you publish a scoreboard. If you tweak the prompt halfway through a module's evaluation, you'll think your RAG got better when really the judge just got more generous.

Judge bias. Judges tend to reward verbosity. A three-sentence answer gets a higher completeness score than a one-sentence answer even when the one-sentence answer is complete. Calibrate with a small hand-graded set; know the bias is there.

Piece three — the run store

An eval "run" is: a configuration (which module, which transform, which chunking), a question set (which fixtures file), a timestamp, and a result table (per-question scores and per-config aggregates).

Store runs in Postgres. One row per run in eval_runs, one row per question in eval_results. You will regret storing them in JSON files because you will want to query "best config across the last 30 days" and a SQL query is one line.

Schema sketch:

create table eval_runs (
    id uuid primary key,
    config_name text not null,
    fixture_path text not null,
    fixture_count int not null,
    started_at timestamptz default now(),
    finished_at timestamptz,
    aggregate jsonb not null,  -- {"recall@5": 0.85, "mrr": 0.72, ...}
    cost_usd numeric(10,6),
    notes text
);

create table eval_results (
    run_id uuid references eval_runs(id) on delete cascade,
    question text not null,
    expected_source_id text,
    recall_at_5 boolean,
    rank int,
    judge_faithfulness int,
    judge_completeness int,
    latency_ms int,
    cost_usd numeric(10,6),
    primary key (run_id, question)
);

Tiny, reliable, queryable. You will build a dashboard on top of this in Module 6.

Piece four — regression testing

Two runs, one diff. Given run A (Module 3 full) and run B (Module 4 hyde), produce:

Per-question: did faithfulness change? Completeness? Recall?
Aggregate deltas and their p-values (use a simple paired t-test — scipy has it).
A list of questions that regressed (B worse than A) and a list that improved.

The regression diff is the thing you will stare at for the rest of your RAG life. When you ship a change and it helps on aggregate but regresses on three specific questions, those three questions are the signal. They tell you where your change breaks.

Fail the regression check automatically if more than 10% of questions regressed, even if aggregate improved. That is the production bar — you do not ship a change that makes the system worse for 10% of users even if it's better for the other 90%.

The 30-question eval set

At the end of Module 5 you have a fixtures file with 30 questions, curated from your own corpus, that collectively stress every technique across Modules 1-4:

5 cross-chunk synthesis (Module 2 targets)
5 keyword / acronym (Module 3 BM25 targets)
5 near-duplicate / reranker targets
5 short-query (Module 4 HyDE targets)
5 vocabulary-mismatch (Module 4 multi-query targets)
5 multi-hop (Module 4 decomposition targets)

Each question has: the question text, the expected source id(s), a reference answer, and a tag for its failure category. You store these in a single JSON file in /data/eval/standard.jsonl.

This set is the score you track for the rest of the track and beyond. Every configuration change is scored against it. You will extend it over time as you discover new failure modes.

Your build task

Build app/evals/. The module already has stub READMEs from Mini's scaffold; fill in:

metrics.py — recall_at_k, mrr, ndcg_at_k
judge.py — judge_answer(question, reference, candidate) -> dict
runner.py — run(config_callable, fixtures_path, name) -> run_id
regression.py — diff_runs(run_a_id, run_b_id) -> ReportDict
db.py — SQLAlchemy models for eval_runs and eval_results

And then migrate the schema (Alembic is already in pyproject). First migration.

Ship gate

No boss challenge this module. Module 5's ship gate IS the 30-question eval set.

Pass when:

All four pieces implemented and tested (at minimum, pytest app/tests/test_evals.py green)
Your 30-question fixtures file exists in /data/eval/standard.jsonl with all six categories populated
You run the 30-question set against every prior module's best config and produce one comparison table at data/reports/module-5-scoreboard.md
That scoreboard shows Module 4 best beating Module 1 baseline on Recall@5 by at least 25 points (the track's success criterion)

What this unlocks

Module 6's cost dashboard reads from eval_runs. The completion artifact at end-of-track pulls from this scoreboard. Every change you make to your RAG after this module is measurable against the same 30 questions, always. The measurement discipline is the asset; the RAG is just its client.