The Top-K problem in late interaction

Why an approximate ColBERT pipeline can quietly miss the right answer, even when it's already in the candidate pool.

The setup

Our late-interaction pipeline has two layers of Top-K truncation. Each one is a place where the right answer can disappear without warning. This page walks through the second, sneakier one: per-query-token truncation during the rerank step.

1. Where Top-K shows up in the pipeline

1Stage 1: dense ANN over the full corpus

ns.query(rank_by=("vector", "ANN", q), top_k=100)
→ returns 100 candidate doc IDs from a corpus of 10,000 (or 10M)

↓

!First truncation: candidate pool

If the true best doc lands at dense rank 101, it's already gone. ColBERT never sees it. This is the well-known limitation of any retrieve-then-rerank pipeline.

↓

2Stage 2: per-query-token ANN over the token namespace

for each of 32 query tokens q_tok:
    token_ns.query(
        rank_by=("vector", "ANN", q_tok),
        filters=("doc_id", "In", candidate_ids),  # 100 docs
        top_k=1500                              # per query token!
    )

↓

!Second truncation: per-query-token cap

Each query token's ANN search returns only the top 1500 doc tokens. If a candidate doc's best match for some query token sits at rank 1501, that doc gets zero contribution from that query token. Its MaxSim score is undercounted. It can be ranked below docs that scored "luckier" matches across the board.

2. Why 1500 isn't always enough

The math behind the default: 100 candidate docs × ~15 tokens per Quora question = ~1500 doc tokens total inside the filter. Set top_k=1500 and you cover everything, every time.

That math holds for short questions. It breaks the moment your docs get longer or your candidate pool grows:

scenario	candidates	tokens / doc	tokens in filter	top_k=1500 covers?
Quora questions (our default)	100	~15	~1,500	yes
Wider candidate pool	500	~15	~7,500	no, only 20%
Support articles	100	~150	~15,000	no, only 10%
Long documents at scale	500	~150	~75,000	no, only 2%

When the filter contains more tokens than top_k can return, every per-query-token ANN call gets a truncated view. Some docs' tokens fall off the cliff.

3. Concrete example

Query has 4 tokens. 5 candidate docs (X, A, B, C, D). top_k=4 per query token. Each cell = the best similarity that query token found in that doc. Red cells fell just past the cutoff and became 0.

Truth (no truncation)

how

make

money

online

score

doc X

0.83

0.94

0.97

0.96

3.70

doc A

0.91

0.95

0.81

0.93

3.60

doc B

0.89

0.71

0.74

0.85

3.19

doc C

0.87

0.68

0.69

0.82

3.06

doc D

0.85

0.55

0.78

2.73

doc X wins. This is the ranking MaxSim is supposed to produce.

Observed (with `top_k=4` truncation)

Doc X's "how" token landed at rank 5 in the ANN results. Its "online" token also at rank 5. Both fell past the cutoff and got treated as 0.

how

make

money

online

score

doc X

0.00

0.94

0.97

0.00

1.91

doc A

0.91

0.95

0.81

0.93

3.60

doc B

0.89

0.71

0.74

0.85

3.19

doc C

0.87

0.68

0.69

0.82

3.06

doc D

0.85

0.55

0.78

2.73

What happened

Doc X was the true best match. Two of its per-token similarities silently became 0 because they ranked just past the cutoff. Doc A wins, even though its true score (3.60) is lower than doc X's true score (3.70).

5. Why this is hard to spot

It's silent. A doc dropping below the per-query-token cutoff just doesn't appear in the response. There's no error, no warning. Its score quietly becomes 0 for that query token.
It's lopsided. A doc that's moderately good across the board (always inside top_k for every query token) often beats a doc that's excellent on average but barely outside top_k for one or two query tokens. The cutoff favors consistency over peaks.
It scales with corpus size. Bigger candidate pools, longer docs, and crowded vector neighborhoods all push more tokens out of the cap.
It doesn't show in MRR if you're lucky. If your test set happens to have well-separated answers (like Quora's short questions), you might never notice. The bug shows up on the dataset that matters most: the dense, long-doc one where late interaction is supposed to shine.

6. The knobs you can turn

lever	effect	cost
Increase `top_k` per query token	fewer truncation misses	larger response payload, more compute server-side. Max 10,000.
Decrease the candidate pool	fewer total tokens competing for the cap	stage 1 might miss the right answer entirely (the other Top-K problem)
Use exact KNN for stage 2	no truncation possible	scales poorly past a few thousand tokens; defeats the purpose of ANN
Server-side MaxSim operator	server walks every candidate doc's full token list, no per-query-token truncation needed	none, if turbopuffer ships it (see main walkthrough → section 7)
Multi-vector storage with row-level reads	fetch each candidate's full token list directly, then compute MaxSim locally	larger response payload than Option B above

7. The takeaway for the assignment

What this proves about the implementation

The current ColBERT rerank function is an approximation with recall loss, not a faithful implementation of MaxSim. Two truncation steps can hide the right answer:

Stage 1 dense top_k decides which 100 docs even reach the rerank.
Stage 2 per-query-token top_k decides which doc tokens contribute to each doc's score.

For Quora-length questions with top_k=1500, the second truncation almost never bites. For longer docs or wider candidate pools, it does, and the failure mode is silent. A faithful MaxSim would require either exact KNN or a server-side operator that walks each candidate's full token list, which is exactly the multi-vector pitch in the main guide.

Companion to the late interaction guide · back to main walkthrough →

The Top-K problem in late interaction

1. Where Top-K shows up in the pipeline

2. Why 1500 isn't always enough

3. Concrete example

Truth (no truncation)

Observed (with top_k=4 truncation)

5. Why this is hard to spot

6. The knobs you can turn

7. The takeaway for the assignment

Observed (with `top_k=4` truncation)