The Top-K problem in late interaction

Why an approximate ColBERT pipeline can quietly miss the right answer, even when it's already in the candidate pool.

The setup

Our late-interaction pipeline has two layers of Top-K truncation. Each one is a place where the right answer can disappear without warning. This page walks through the second, sneakier one: per-query-token truncation during the rerank step.

1. Where Top-K shows up in the pipeline

1Stage 1: dense ANN over the full corpus
ns.query(rank_by=("vector", "ANN", q), top_k=100)
→ returns 100 candidate doc IDs from a corpus of 10,000 (or 10M)
!First truncation: candidate pool

If the true best doc lands at dense rank 101, it's already gone. ColBERT never sees it. This is the well-known limitation of any retrieve-then-rerank pipeline.

2Stage 2: per-query-token ANN over the token namespace
for each of 32 query tokens q_tok:
    token_ns.query(
        rank_by=("vector", "ANN", q_tok),
        filters=("doc_id", "In", candidate_ids),  # 100 docs
        top_k=1500                              # per query token!
    )
!Second truncation: per-query-token cap

Each query token's ANN search returns only the top 1500 doc tokens. If a candidate doc's best match for some query token sits at rank 1501, that doc gets zero contribution from that query token. Its MaxSim score is undercounted. It can be ranked below docs that scored "luckier" matches across the board.

2. Why 1500 isn't always enough

The math behind the default: 100 candidate docs × ~15 tokens per Quora question = ~1500 doc tokens total inside the filter. Set top_k=1500 and you cover everything, every time.

That math holds for short questions. It breaks the moment your docs get longer or your candidate pool grows:

scenariocandidatestokens / doctokens in filtertop_k=1500 covers?
Quora questions (our default) 100 ~15 ~1,500 yes
Wider candidate pool 500 ~15 ~7,500 no, only 20%
Support articles 100 ~150 ~15,000 no, only 10%
Long documents at scale 500 ~150 ~75,000 no, only 2%

When the filter contains more tokens than top_k can return, every per-query-token ANN call gets a truncated view. Some docs' tokens fall off the cliff.

3. Concrete example

Query has 4 tokens. 5 candidate docs (X, A, B, C, D). top_k=4 per query token. Each cell = the best similarity that query token found in that doc. Red cells fell just past the cutoff and became 0.

Truth (no truncation)

how
make
money
online
score
doc X
0.83
0.94
0.97
0.96
3.70
doc A
0.91
0.95
0.81
0.93
3.60
doc B
0.89
0.71
0.74
0.85
3.19
doc C
0.87
0.68
0.69
0.82
3.06
doc D
0.85
0.55
0.55
0.78
2.73

doc X wins. This is the ranking MaxSim is supposed to produce.

Observed (with top_k=4 truncation)

Doc X's "how" token landed at rank 5 in the ANN results. Its "online" token also at rank 5. Both fell past the cutoff and got treated as 0.

how
make
money
online
score
doc X
0.00
0.94
0.97
0.00
1.91
doc A
0.91
0.95
0.81
0.93
3.60
doc B
0.89
0.71
0.74
0.85
3.19
doc C
0.87
0.68
0.69
0.82
3.06
doc D
0.85
0.55
0.55
0.78
2.73
What happened

Doc X was the true best match. Two of its per-token similarities silently became 0 because they ranked just past the cutoff. Doc A wins, even though its true score (3.60) is lower than doc X's true score (3.70).

5. Why this is hard to spot

6. The knobs you can turn

levereffectcost
Increase top_k per query token fewer truncation misses larger response payload, more compute server-side. Max 10,000.
Decrease the candidate pool fewer total tokens competing for the cap stage 1 might miss the right answer entirely (the other Top-K problem)
Use exact KNN for stage 2 no truncation possible scales poorly past a few thousand tokens; defeats the purpose of ANN
Server-side MaxSim operator server walks every candidate doc's full token list, no per-query-token truncation needed none, if turbopuffer ships it (see main walkthrough → section 7)
Multi-vector storage with row-level reads fetch each candidate's full token list directly, then compute MaxSim locally larger response payload than Option B above

7. The takeaway for the assignment

What this proves about the implementation

The current ColBERT rerank function is an approximation with recall loss, not a faithful implementation of MaxSim. Two truncation steps can hide the right answer:

  1. Stage 1 dense top_k decides which 100 docs even reach the rerank.
  2. Stage 2 per-query-token top_k decides which doc tokens contribute to each doc's score.

For Quora-length questions with top_k=1500, the second truncation almost never bites. For longer docs or wider candidate pools, it does, and the failure mode is silent. A faithful MaxSim would require either exact KNN or a server-side operator that walks each candidate's full token list, which is exactly the multi-vector pitch in the main guide.

Companion to the late interaction guide · back to main walkthrough →