Hybrid search and reranking

Module 2 let you pick the embedding model and the chunking strategy. Your Recall@5 went up. Your boss challenge went up. Certain question types stayed flat — the ones with specific keywords, the ones with near-duplicates, the ones with negation. Module 3 goes after those.

You will add a second retrieval system that does not use embeddings at all, fuse its results with your vector results, then pass the fused set through a dedicated reranking model that rescores each candidate for the specific query. By the end of the module, you will have a retrieval pipeline that looks a lot like what a production RAG at a well-run company actually does.

The two failures pure vector cannot solve

You have already felt both.

Keyword-specific queries

A user asks "what's the default TTL?" and they mean the word TTL. A good embedding model will softly understand "TTL" as "time to live" and "cache expiration" and the retrieved top-k will include chunks that never say the letters T-T-L. That's semantically smart. It is also sometimes wrong — the right chunk is the one that actually contains the acronym and its number.

Pure vector is a continuous signal. It rewards chunks that are near the query in meaning. BM25 is a discrete signal. It rewards chunks that literally contain the query's rarer words. When the question is "what's the default TTL?" you want both signals. Vector to catch paraphrased answers. BM25 to catch the sentence that contains the literal phrase.

Near-duplicates and distractors

A corpus about prompt caching mentions "prompt caching" in ten different chunks. The chunk that answers "what does cache_control do?" is one of them. Pure vector retrieval scores them all roughly the same, because they all contain the overall topic. The chunk with the answer wins by a thin margin, or loses.

A cross-encoder reranker is a small transformer trained to score query-chunk pairs for relevance. It is slower than a bi-encoder (which is what your embedding model is — it encodes query and doc separately, then compares with cosine). But slower is fine when you are only reranking your top 20 or 50 candidates, not the whole corpus. The cross-encoder looks at the query and the candidate together, and its score reflects whether this candidate specifically answers this query. It pushes the right chunk to position one far more reliably than cosine alone.

The three pieces you build

Piece one — BM25 index

BM25 is a keyword ranker with a fifty-year pedigree (okay, thirty — Robertson and Walker, 1994). It scores a document against a query based on term frequency and inverse document frequency, with saturation terms so documents don't get rewarded for repeating a word ten times. The rank-bm25 Python library does the work in four lines.

You tokenize your chunks (lowercase, split on whitespace and punctuation, drop stopwords). You build a BM25Okapi index over the tokenized chunks. At query time, you tokenize the query the same way and call get_top_n. You get back the top-k chunks by BM25 score.

A few practical notes:

Tokenization matters. Underscore-separated identifiers (cache_control) should be split on the underscore so they match user queries. A naive whitespace split misses them.
BM25 does not know about phrases. If a user asks "cache control," BM25 scores chunks that contain "cache" and chunks that contain "control" independently. That is usually fine. Phrase-aware retrieval is a later problem.
BM25 is stateless and fast. Building the index for a thousand chunks takes milliseconds. Do not cache it to disk unless your corpus is huge.

Piece two — Reciprocal Rank Fusion

You have two candidate lists now. Vector retrieval gives you one. BM25 gives you another. The natural question: how do you merge them?

The wrong answer is to convert each to a score, weight them, sum them up. This breaks because the scores live on different scales — cosine similarities run 0-1, BM25 scores are unbounded positive floats. Tuning the weight is hand-work that does not generalize.

The right answer is RRF. For each candidate, look at its rank in each list (1, 2, 3, ...). Compute 1 / (k + rank) for each list, where k is a small constant like 60. Sum across lists. The candidate that ranks high in both lists wins because you add two large reciprocals. A candidate that ranks first in one list and does not appear in the other still does well. A candidate that ranks low in both gets nothing.

RRF is rank-only — it ignores the raw scores and only cares about position. That is the feature, not the bug. It is one line of code that works better than any score-based fusion you will tune in an afternoon.

Piece three — Cross-encoder reranker

You have a fused list now. Top 20, say. Not top 5, because you want the reranker to have candidates to work with.

Cohere Rerank 3 is the pragmatic choice. It's a hosted API, $2 per 1000 searches, low latency. You send it the query and the list of candidate texts; it returns the same candidates with new scores. Keep the top 5.

If you want to stay off a paid API, a local cross-encoder like cross-encoder/ms-marco-MiniLM-L-6-v2 runs on CPU in 100ms or so for 20 candidates. Quality is lower than Cohere Rerank 3 but not by a lot on English documents.

Either way, the pattern is:

candidates = fused_top_20(query)
reranked = rerank(query, candidates)
top_5 = reranked[:5]
prompt = build(query, top_5)

Five minutes of code. Measurable impact on quality.

What the pipeline looks like end to end

Query
  ├── vector_retrieve(query, top_k=20)   ──┐
  └── bm25_retrieve(query, top_k=20)     ──┤
                                           ▼
                                    rrf_fuse(k=60)
                                           │
                                           ▼
                                    rerank(cohere-rerank-3)
                                           │
                                           ▼
                                  prompt.build(top_5)
                                           │
                                           ▼
                                 claude.messages.create

Compare to Module 1:

Query
  └── vector_retrieve(query, top_k=5)
                │
                ▼
           prompt.build(top_5)
                │
                ▼
      claude.messages.create

You added three steps. You added some latency (BM25 is free; RRF is free; reranker is 50-150ms). You added some cost (reranker API). You measured recall and judge score climb meaningfully on the boss set.

What to measure

The Module 2 ship gate asked for Recall@5 and cost per config. Module 3 asks for the same metrics across four pipeline configs:

Vector-only (Module 2 best) — your baseline for this module
BM25-only — useful sanity check; usually worse than vector on most queries, better on keyword-heavy queries
RRF fused — vector + BM25 without the reranker
Full hybrid + rerank — the pipeline above

The gate passes when (4) beats (1) on Recall@5 by at least 15 percentage points on the Module 1 + Module 2 + Module 3 combined boss set.

Why a small reranker beats a big embedding model

A fair question: if the reranker is so much better than cosine similarity, why not use it for retrieval and skip the embedding step entirely?

Because reranking every chunk against every query is O(corpus_size) per query. A thousand-chunk corpus runs a thousand reranker calls per question. That doesn't scale and doesn't fit in latency budgets.

The structure of the pipeline is: use a cheap approximation (embedding + BM25) to find the 20 most-likely chunks, then use an expensive exact-ish method (cross-encoder) to reorder those 20. You pay for exactness only where it matters.

This is the same shape that shows up in ad-tech, in search engines, in recommendation systems. Retrieve-then-rerank is one of the most durable patterns in ranking.

Your build task

Write learner/module_3/. Expose:

def query(question: str, top_k: int = 5, config: str = "full") -> dict:
    ...

where config selects a pipeline: vector, bm25, rrf, or full. Each config runs against the same underlying Chroma collection — you don't rebuild the index, you just turn pipeline pieces on and off.

You'll also need a build_bm25_index() function that scans your corpus, tokenizes, and keeps the index in memory (or on disk — it's small). The ship gate script calls this before querying.

Ship gate

scripts/ship-gate.sh module_3 runs four pipeline configs against the combined boss set and writes a report. Pass criteria:

All four configs produce valid results
Full hybrid + rerank beats vector-only by at least 15 pts on Recall@5
Reranker latency is reported per query (cross-encoder adds budget; you document it)
Reranker cost per 1000 queries is reported

Boss challenge

Ten questions in fixtures-boss.jsonl. Five are keyword-heavy (BM25 is expected to help). Three are near-duplicate distractors (reranker is expected to help). Two target acronyms and identifiers that embeddings mangle.

Vector-only should score low on the keyword and acronym questions. Full hybrid + rerank should score high on all three classes.

What happens next

Module 4 attacks the query itself. Sometimes the right chunk is in your index and no amount of retrieval or reranking pulls it up, because the query is phrased wrong. HyDE, multi-query, and decomposition fix that. But first, beat this boss.