Why an approximate ColBERT pipeline can quietly miss the right answer, even when it's already in the candidate pool.
Our late-interaction pipeline has two layers of Top-K truncation. Each one is a place where the right answer can disappear without warning. This page walks through the second, sneakier one: per-query-token truncation during the rerank step.
ns.query(rank_by=("vector", "ANN", q), top_k=100)
→ returns 100 candidate doc IDs from a corpus of 10,000 (or 10M)
If the true best doc lands at dense rank 101, it's already gone. ColBERT never sees it. This is the well-known limitation of any retrieve-then-rerank pipeline.
for each of 32 query tokens q_tok:
token_ns.query(
rank_by=("vector", "ANN", q_tok),
filters=("doc_id", "In", candidate_ids), # 100 docs
top_k=1500 # per query token!
)
Each query token's ANN search returns only the top 1500 doc tokens. If a candidate doc's best match for some query token sits at rank 1501, that doc gets zero contribution from that query token. Its MaxSim score is undercounted. It can be ranked below docs that scored "luckier" matches across the board.
The math behind the default: 100 candidate docs × ~15 tokens per Quora question = ~1500 doc tokens total inside the filter. Set top_k=1500 and you cover everything, every time.
That math holds for short questions. It breaks the moment your docs get longer or your candidate pool grows:
| scenario | candidates | tokens / doc | tokens in filter | top_k=1500 covers? |
|---|---|---|---|---|
| Quora questions (our default) | 100 | ~15 | ~1,500 | yes |
| Wider candidate pool | 500 | ~15 | ~7,500 | no, only 20% |
| Support articles | 100 | ~150 | ~15,000 | no, only 10% |
| Long documents at scale | 500 | ~150 | ~75,000 | no, only 2% |
When the filter contains more tokens than top_k can return, every per-query-token ANN call gets a truncated view. Some docs' tokens fall off the cliff.
Query has 4 tokens. 5 candidate docs (X, A, B, C, D). top_k=4 per query token. Each cell = the best similarity that query token found in that doc. Red cells fell just past the cutoff and became 0.
doc X wins. This is the ranking MaxSim is supposed to produce.
top_k=4 truncation)Doc X's "how" token landed at rank 5 in the ANN results. Its "online" token also at rank 5. Both fell past the cutoff and got treated as 0.
Doc X was the true best match. Two of its per-token similarities silently became 0 because they ranked just past the cutoff. Doc A wins, even though its true score (3.60) is lower than doc X's true score (3.70).
0 for that query token.| lever | effect | cost |
|---|---|---|
Increase top_k per query token |
fewer truncation misses | larger response payload, more compute server-side. Max 10,000. |
| Decrease the candidate pool | fewer total tokens competing for the cap | stage 1 might miss the right answer entirely (the other Top-K problem) |
| Use exact KNN for stage 2 | no truncation possible | scales poorly past a few thousand tokens; defeats the purpose of ANN |
| Server-side MaxSim operator | server walks every candidate doc's full token list, no per-query-token truncation needed | none, if turbopuffer ships it (see main walkthrough → section 7) |
| Multi-vector storage with row-level reads | fetch each candidate's full token list directly, then compute MaxSim locally | larger response payload than Option B above |
The current ColBERT rerank function is an approximation with recall loss, not a faithful implementation of MaxSim. Two truncation steps can hide the right answer:
For Quora-length questions with top_k=1500, the second truncation almost never bites. For longer docs or wider candidate pools, it does, and the failure mode is silent. A faithful MaxSim would require either exact KNN or a server-side operator that walks each candidate's full token list, which is exactly the multi-vector pitch in the main guide.
Companion to the late interaction guide · back to main walkthrough →