Embeddings and chunking strategy

Module 1 gave you a RAG. It also gave you a number — your Recall@5 and your judge score on the boss challenge. That number is small enough that you know something is wrong, big enough that you know something is right. This module makes both numbers move by swapping two pieces: how you cut your documents into chunks, and which model you use to turn those chunks into vectors.

Why these two levers, and only these two

There are a hundred knobs in a RAG system. Most of them matter less than you think. Retrieval succeeds or fails based on two questions:

Is the right chunk a sensible unit of text? (chunking)
Does the embedding model put that chunk near semantically relevant queries? (embedding)

Everything later — reranking, hybrid search, query transformation, agentic loops — is fixing failures that better chunks and better embeddings would have prevented. If you get these two right, the rest of the track is refinement. If you get them wrong, nothing downstream can fully save you.

Chunking strategy, in four flavors

You're going to implement three beyond the Module 1 fixed-size baseline, and you're going to run them against the boss set.

Fixed-size (baseline)

Take the text, cut every N characters, allow some overlap between adjacent chunks. That is Module 1. It ignores sentences, paragraphs, sections, tables. It cuts mid-word. It separates a definition from the sentence that uses it.

Why this is the baseline: it works well enough for surprisingly many corpora. Uniform chunk size is good for embedding models that have been trained on roughly-uniform inputs. And it's trivial to implement.

Why this breaks: when an answer needs to span a chunk boundary. When a heading belongs with its section body. When code blocks should not be halved. When tables lose their captions.

Structural chunking

Cut on the document's own structure. Markdown headings split the document. HTML <section> tags split the document. PDF page breaks split the document. Each leaf is a chunk.

This works when your corpus has structure you can trust. A developer documentation site has clean h2/h3 hierarchy. A scanned PDF of a 1980s contract does not.

Structural chunks vary in size — some are 200 characters, some are 3000. That variance can hurt retrieval (embedding quality drops at both tails) and can overflow context at generation time. Mitigate with a max-size fallback that further splits oversized chunks.

Semantic chunking

Embed every sentence. Walk through the document adding sentences to a current chunk until a sentence comes along whose embedding is more than some threshold away from the running chunk centroid. Start a new chunk.

The theory: chunks end at semantic breakpoints. A paragraph about caching pricing becomes one chunk; when the document shifts to breakpoint syntax, a new chunk starts.

The cost: you make one embedding call per sentence during ingest. For a corpus of thousands of pages, that is a real bill. Most of the time the savings in retrieval quality justify it; sometimes they don't. You'll measure and see.

Hierarchical chunking

Small chunks for retrieval, larger chunks for generation. Index the small chunks. When one scores well, pull its parent (the section it belongs to) into the prompt instead of the leaf.

This decouples two conflicting pressures. Small chunks retrieve precisely — a one-paragraph chunk has a focused meaning. Large chunks generate well — Claude answers better when it has context around the retrieved snippet, not just the snippet.

Implementation: store both. Leaf chunks go into Chroma with a parent-id metadata field. When you retrieve, you fetch leaves and then dereference to their parents for the prompt. The extra lookup is cheap.

Embedding models, in two families

You will test at least two, and you will feel the difference on your corpus.

Voyage `voyage-3`

Voyage is a small provider (now owned by MongoDB) that ships embedding models tuned specifically for retrieval. In independent evaluations, voyage-3 leads or ties the best-in-class on retrieval benchmarks while costing less than OpenAI's equivalent.

One thing to watch — Voyage's free tier caps you at 3 requests per minute until you add a payment method. You will hit this while ingesting anything larger than your first few seed documents. Add a card; the free tier token count stays intact.

OpenAI `text-embedding-3-small` and `text-embedding-3-large`

OpenAI's small model is the cheap option. The large model is closer in quality to Voyage but costs more. Both produce 1536-dim and 3072-dim vectors respectively, though you can truncate with no major quality loss using Matryoshka Representation Learning, which OpenAI supports by just slicing the output.

If you already have an OpenAI key for something else, the small model is a zero-friction baseline.

Other models you could try (out of scope for ship gate)

nomic-embed-text — open-source, can run locally on CPU. No rate limit because it's your CPU. Slower at query time; fine for ingest.
bge-large-en-v1.5 — strong open-source model, requires a GPU or CPU patience.
Cohere embed-v3 — good, but we save Cohere for Module 3's reranker.

How to compare them honestly

Fix the question set. Fix the metric. Vary one thing at a time.

Question set

Reuse the Module 1 boss set, plus five new questions you add in content/module-2-embeddings/fixtures-boss.jsonl. The new questions should target chunking specifically — answers that span boundaries, tables with captions, section-dependent terminology.

Metrics

Recall@5 — did the expected source appear in the top five retrieved chunks?
Recall@10 — same, but the top ten. If Recall@10 is high but Recall@5 is low, your reranker has room to work in Module 3.
MRR (Mean Reciprocal Rank) — 1/rank averaged across the question set. Captures "did the expected source come up at all, and how high?"
Judge score — ask Haiku 4.5 to score the generated answer for faithfulness (0-5). Not perfect, but reproducible.
Cost per 1000 queries — embedding and generation combined.
Ingest cost — one-time, but not zero. Voyage + fixed-size is the cheapest; OpenAI-large + semantic is the most expensive. The factor between them is often 20x.
Ingest time — how long it takes to reindex the whole corpus. Matters when you iterate.

Matrix

	voyage-3	oai-small	oai-large
fixed-size	Module 1 baseline
structural
semantic
hierarchical

You don't have to fill every cell. Fill the four that matter: Module 1 baseline (top-left), the best chunking with voyage-3, the best chunking with OpenAI, and one cell where you expect the combination to win. The ship gate wants six cells; pick the six.

What you will probably find

Three things tend to happen on real corpora. Your results may differ — that's why you measure.

Voyage wins on cosine retrieval for technical documentation. OpenAI wins for conversational or narrative documents.
Semantic chunking beats fixed-size on Recall@5 by 5-15 points, and the gap grows for questions that span traditional chunk boundaries.
Hierarchical chunking improves judge score more than it improves retrieval metrics. The retrieval is similar; the generation has more context and hallucinates less.

None of these is universal. The whole point of the module is to stop treating them as universal.

Your build task

Write learner/module_2/. Expose:

def query(question: str, top_k: int = 5, config: str = "default") -> dict:
    ...

where config selects which of your ingested indexes to query. Populate your configs with names like voyage-fixed, oai-small-semantic, voyage-hierarchical.

Also write learner/module_2/ingest.py that takes a chunking strategy and an embedding provider and builds a Chroma collection named accordingly. Module 1 had one collection; Module 2 has six (or however many you fill in).

Ship gate

scripts/ship-gate.sh module_2 runs your six configs against the Module 1 fixtures and the new Module 2 fixtures, produces a markdown table under data/reports/module-2.md, and checks:

All six configs produce a query() result
At least one config beats Module 1 baseline on Recall@5 by 10 percentage points
The report contains cost per config (not just quality)
You can point to the winning config and articulate why it won

Pass the gate and Module 3 unlocks. If the boss sees Recall@5 worse than Module 1, something is wrong with your implementation, not the theory.

Boss challenge

Eight questions in fixtures-boss.jsonl. Three target chunking (answers that cross boundaries, table-with-caption, section-dependent terminology). Three target embedding provider (technical jargon, narrative synonyms, Hawaiian-pidgin contractor terms). Two test hierarchical specifically — questions whose leaf chunk is in your top-1 but whose parent chunk is what Claude actually needs to answer well.

Expected baseline: your best Module 1 config scores roughly 50% on this set. Your best Module 2 config should clear 70%. If you can't make that gap happen, keep iterating — the answer is in there.

Why this matters for the rest of the track

Module 3 adds hybrid search and reranking. Those techniques compound with good chunking and good embeddings; they don't replace them. If you enter Module 3 with a weak Module 2 foundation, Module 3's gains will be smaller than they should be. Spend the time here.