Query transformation

Every module so far has worked on the corpus side. Better chunks in Module 2. Better retrieval with BM25 and reranking in Module 3. Your boss challenges in those modules got better because the system got smarter about what it had already indexed.

Module 4 turns around and looks at the query. Same index, same retriever, same reranker — but we change what you ask before you ask it. On the right question set, this is the largest single win in the whole track.

Why the query needs work

Three kinds of questions break retrieval no matter how good your index is.

Short queries land in bad neighborhoods

A user types "HyDE." That's one token. The embedding of "HyDE" is near the embeddings of "hybrid," "hype," and every other short H-word. The chunk that actually defines HyDE has the literal acronym in a sentence surrounded by other content — its embedding lives near other technical-doc embeddings, not near "HyDE" in isolation.

Short queries lose. Always. The embedding model was trained on text, not on fragments.

Vocabulary mismatch

A user asks about "prompt routing." The corpus uses the term "model selection." The embedding model knows these are related, but not as related as "prompt routing" is to another chunk that literally says "prompt routing" while talking about something else entirely. Module 3's reranker catches some of this. Most of it slips through.

Multi-hop

"Which chapter of Hawaii statutes governs HIePRO solicitations?" needs two pieces of information — what HIePRO is, and the chapter that governs that thing. No single chunk contains both. Retrieval gives you one or the other.

Three transformations, in order of impact

HyDE — Hypothetical Document Embeddings

The cheapest and often the largest win. Before retrieval, ask a small model (Haiku 4.5) to write what the answer would look like if you already knew it. Embed that imagined answer. Retrieve against the answer's embedding.

Why this works: queries are short and questiony. Answers are long and declarative. The embedding of a hypothetical answer lands near the embeddings of real answer chunks. The embedding of the question lands near other questions. Your corpus does not contain questions; it contains answers.

The prompt is roughly:

Write a one-paragraph answer to this question as if you already knew
the answer. Invent plausible details if needed. The answer does not
need to be correct — it will be used to find relevant documents.

Question: <user query>

Haiku 4.5 gives you back 100 words. You embed that. You retrieve. The embeddings you get back land on the right chunks roughly 15-25 percentage points more often than the question's own embedding did.

The caveat: you paid one LLM call before retrieval. That is pure latency. On a fast model it's 500ms. For high-value queries that's cheap. For a search-as-you-type autocomplete it's too slow.

Multi-query expansion

One query becomes three (or five) rewrites. Each rewrite retrieves its own top-20. You dedupe and merge (RRF from Module 3 fuses the lists).

Why this works: different phrasings surface different chunks. A single vector query explores a single neighborhood of the embedding space. Three queries explore three neighborhoods. The union is strictly larger than any one.

User asked: "How do I turn on prompt caching?"
Rewrite 1: "Enabling cache_control breakpoints in Anthropic API requests"
Rewrite 2: "Setting up prefix caching for repeated system prompts"
Rewrite 3: "Configuring prompt cache with TTL in Claude API"

Now you have three queries. Three vector top-20 lists. One fused list. Your reranker has more good candidates to choose from.

Multi-query costs three LLM calls' worth of rewrites (fast, cheap — Haiku at small context) plus three retrieval calls (cheap) plus three embedding calls (cheap). The wall-clock hit is real but parallelizable. Run them concurrently.

Decomposition

Multi-hop questions need multi-hop retrieval. Given "Which chapter of Hawaii statutes governs HIePRO solicitations?" a cheap model decomposes into:

1. What is HIePRO?
2. Which chapter of Hawaii Revised Statutes governs [the answer to 1]?

You retrieve on (1), extract a fact from the retrieved chunks ("HIePRO is Hawaii's procurement e-portal, operated under chapter 103D"), substitute into (2), retrieve on (2), combine results.

This is the closest Module 4 gets to an agent. You're calling a model, using its output, calling it again. There is no loop — you decompose once, run two retrievals, done.

Decomposition is the highest cost of the three transforms (at least one extra LLM call per hop, plus extra retrieval) and the highest variance (if the decomposition is bad, both retrievals are bad). Use it only when the question is multi-hop.

How to combine them

The production pattern is not "turn them all on for every query." It is: classify the query, then apply the right transform.

A cheap classifier call (Haiku 4.5, one token response) categorizes every incoming query:

simple — one fact, one chunk answers it. Run vanilla Module 3 pipeline.
ambiguous or short — apply HyDE.
broad — apply multi-query expansion.
multi-hop — apply decomposition.

This adds one LLM call per query. It saves the cost of running expensive transforms on cheap questions.

You don't need this routing layer for the ship gate. The ship gate only asks you to implement the three transforms and A/B them against the Module 3 baseline. But once you have it working, the routing layer is one prompt, one classifier, and a real production pattern.

What about agentic retrieval

A related technique — ReAct-style loops where the model retrieves, reads, decides it needs more context, retrieves again — sits right at the edge of what Module 4 does. That pattern is the core of the future Agents & Tool Use track. Module 4 stops at one-shot transformation. Any time the model is making a decision about whether to retrieve again, you've crossed into agent territory.

The distinction matters because agent loops are much harder to make fast and predictable. Module 4's transforms are bounded: HyDE is one extra call, multi-query is N, decomposition is usually two hops. Agent loops are unbounded and need retry logic, cost caps, and observability you haven't built yet. One track at a time.

Your build task

Write learner/module_4/. Expose:

def query(question: str, top_k: int = 5, transform: str = "none", config: str = "full") -> dict:
    ...

transform ∈ {"none", "hyde", "multi_query", "decompose"}. config is the Module 3 pipeline config (keep the knob so you can isolate what the transform is doing vs. what the pipeline is doing).

You'll implement three transform helpers:

hyde(question) -> str — Haiku call, returns the hypothetical answer text
multi_query_rewrite(question, n=3) -> list[str] — returns N query rewrites
decompose(question) -> list[str] — returns an ordered list of sub-questions (empty if not decomposable)

Ship gate

scripts/ship-gate.sh module_4 runs four transform configs against the boss set plus the Module 1/2/3 sets. Reports Recall@5, judge score, latency, cost for each. Pass when:

All four transforms produce valid results
At least one transform beats Module 3 full config by 10pts on Recall@5 or by 15pts on judge score
Latency cost is reported per transform (the learner must know what they bought)
Cost per query is reported per transform

Boss challenge

Ten questions. Four short/ambiguous (HyDE targets). Three phrased differently from corpus vocabulary (multi-query targets). Three multi-hop (decomposition targets).

Expected results: Module 3 full is the baseline. HyDE fixes the short-query category. Multi-query helps across categories. Decomposition fixes the multi-hop category specifically. Best combination beats Module 3 by 15+ points.

What comes next

Module 5 is the evaluation framework — you build the scorer that every module so far has been leaning on. It is where the track's measurement discipline becomes a durable asset. After that, Module 6 puts the whole thing in production.