BlueWave School RAG & Evals Track
RAG & Evals Track

Production concerns

This is the last module. You have a RAG that outperforms the Module 1 baseline by a measurable margin on your own corpus. You have an eval harness that tells you whether any change helps or hurts. You have a query-transform layer that handles the three common question failure modes.

Module 6 takes all of that and makes it something you could put in front of a paying customer. Five concerns in the order they bite.

One — prompt caching and answer caching

Your system prompt is stable. Your tool definitions are stable. Your retrieved context within a session is stable. Claude's prompt caching API lets you mark prefix segments cacheable — the next request that repeats the prefix hits the cache at roughly 10% of normal input token cost.

Module 6 makes you configure this properly. Place cache_control breakpoints on:

Set TTL to 1 hour for stable prefixes, 5 minutes (default) for session-level retrieved context. Track cache hit rate. Most well-configured systems show 60-80% prompt cache hit rates in steady state.

Add an answer cache too. Key it on (normalized_question, corpus, config). A one-line dict cache with a 5-minute TTL catches the near-duplicate queries that always arrive in bursts (a user asking "what's the default TTL" five times because they didn't screenshot the first answer). Bust the cache when the underlying corpus changes.

Measure: your cost-per-query line should drop by 30-50% at moderate traffic. The dashboard in piece three plots this.

Two — observability

Every request leaves a trace. You cannot debug production failures from memory. You need the trace.

Trace schema, append-only:

{
    "trace_id": "2026-04-18T06:00:00-abc123",
    "timestamp": "2026-04-18T06:00:00Z",
    "query": "...",
    "transform": "hyde",
    "hypothetical": "...",
    "vector_candidates": [{"id": "...", "score": 0.82}, ...],
    "bm25_candidates": [...],
    "fused_top_20": [...],
    "reranked_top_5": [...],
    "prompt_length_chars": 4812,
    "model": "claude-sonnet-4-6",
    "prompt_version": "v3",
    "cache_hit": true,
    "generation_tokens_in": 1477,
    "generation_tokens_out": 184,
    "latencies_ms": {"embed": 412, "vector": 33, "bm25": 8, "rerank": 97, "generate": 1900, "total": 2450},
    "cost_usd": 0.0051,
    "final_answer": "..."
}

Write one JSON line per trace to /data/traces/YYYY-MM-DD.jsonl. Rotate daily. Also write a summary row to Postgres so you can SELECT against it.

When a user reports "the system gave me a wrong answer on Tuesday," you grep /data/traces/2026-04-*.jsonl for their query and reproduce the exact pipeline that produced it. That is the only way production debugging scales.

Three — cost dashboard

app/routers/admin.py gets a /admin/costs endpoint that reads your cost log and renders:

Render as an Astro page at /progress/costs for the operator and /admin/costs.json for machine consumption.

Add budget alerts. When daily spend crosses a threshold you configure in .env (DAILY_BUDGET_USD=5.00), write a WARN line to your trace log. Optional — wire this to n8n for an alert you actually notice.

The cost dashboard is itself a teaching moment. By end of this module, you know exactly which queries cost what, and which modules justify their per-query spend. That is the discipline you take with you out of this track.

Four — prompt versioning

Your generation prompt is a file. It has a version string. When you edit it, you bump the version and commit. Every generation call logs which version produced it.

GEN_PROMPT_V3 = """..."""  # bumped from v2 on 2026-04-15, added source-id format hint
GEN_PROMPT_VERSION = "v3"

def generate_answer(question, sources):
    # ... uses GEN_PROMPT_V3 ...
    return {"answer": ..., "prompt_version": GEN_PROMPT_VERSION, ...}

Module 5's eval runs pick up the prompt version as a dimension. Suddenly you can diff "Module 4 best with prompt v2" vs "Module 4 best with prompt v3" and see whether your prompt edit helped or hurt, independent of any other change.

This is the single highest-leverage discipline in all of production LLM work. Prompts drift. Prompts get tweaked. Prompts silently regress. Versioning them turns "did my prompt edit help?" from a vibe-check into a number.

Five — failure modes and fallbacks

Five things can break in production. Each gets a documented fallback and a test.

Voyage rate-limited. Fallback to OpenAI. Or a cached query response if the question matches recent. Tests: synthetic rate-limit error raised; verify fallback kicks in.

Anthropic down. Return a graceful "service unavailable, please try again" with a 503. Log the outage timestamp. Do not silently fall through to a lesser model — users should know when they are getting degraded output.

Chroma unreachable. Retry with exponential backoff up to 3 times. If still failing, 503. Do not fall through to BM25-only — your answers would be low quality without the user knowing.

Cohere rerank API error. Skip reranking; return RRF-fused results. Log the degradation. Users get slightly worse answers but the system works.

Generation timeout. Set a 15-second timeout on the Claude call. On timeout, cancel and return a fallback "I could not generate an answer in time, here are the retrieved sources for reference." Users can scan the sources themselves.

Each of these is a 20-line try/except with a clear log line. Each gets a unit test that simulates the failure. The point is not that the code is complex — it isn't — but that the behavior is defined. When the system degrades, you know exactly how.

Your build task

Upgrade the Module 4 best pipeline to the production version. Specifically:

Module 5's eval harness continues to work; you run the 30-question set against production-mode pipeline and expect final numbers to match (or modestly exceed) Module 4 best.

Ship gate

Module 6's ship gate has the longest pass bar in the track. Pass when:

Completion artifact

On Module 6 ship gate pass, the platform generates a signed completion artifact:

You share it if you want. The PDF is your credential — a thing you built, measured, and shipped.

What ends, what starts

The track ends. Your RAG system is yours — not the platform's. Extend it, rewrite it, rebuild it on a different stack. It is code; it is yours.

The platform keeps running. When BlueWave School's next track ships (Agents & Tool Use, say), your completion of this track is a prerequisite.

Mostly, though, what ends is the excuse. You can't say "I don't know RAG" anymore. You built one. You measured it. You shipped it. Go use it.