Late Interaction on turbopuffer

What the namespaces hold, and how the query flows through them.

Companion page: the Top-K problem in late interaction →

1. What lives in the database

Two namespaces. One dense vector per document. One ColBERT vector per token.

Namespace 1 late-interaction-test (dense)

One row per document · 10,000 rows · 62 MB · 1536-dim OpenAI embeddings

idvector (1536 dims)text
0[ 0.013, -0.421, 0.087, ..., 0.204 ]How do I make money online?
1[ 0.092, 0.155, -0.301, ..., -0.088 ]What are the best ways to earn money from home?
2[-0.044, 0.288, 0.117, ..., 0.331 ]How can I start a successful online business?
3[ 0.211, -0.067, 0.392, ..., 0.018 ]What programming languages should I learn first?
... 9,996 more rows ...
9999[-0.173, 0.252, -0.041, ..., 0.110 ]...

Namespace 2 late-interaction-tokens-test (ColBERT tokens)

One row per token · 157,736 rows · 83 MB · 128-dim ColBERT embeddings · doc_id is filterable

idvector (128 dims)doc_idtoken (not stored)
0[ 0.18, -0.04, 0.22, ..., 0.09 ]0[CLS]
1[ 0.09, 0.31, -0.12, ..., 0.27 ]0[D]
2[-0.21, 0.08, 0.45, ..., -0.03 ]0how
3[ 0.33, -0.17, 0.02, ..., 0.14 ]0do
4[ 0.05, 0.29, -0.31, ..., 0.18 ]0i
5[ 0.41, 0.13, 0.07, ..., -0.22 ]0make
6[-0.08, 0.36, 0.19, ..., 0.31 ]0money
7[ 0.22, -0.11, 0.28, ..., 0.06 ]0online
8[ 0.14, 0.07, -0.05, ..., 0.19 ]0?
9[-0.18, 0.24, 0.31, ..., 0.02 ]0[SEP]
... tokens 10–999 empty (doc 0 only has 10 tokens) ...
1000[ 0.27, -0.09, 0.14, ..., 0.33 ]1[CLS]
1001[ 0.11, 0.33, -0.21, ..., 0.04 ]1[D]
1002[-0.14, 0.19, 0.42, ..., -0.11 ]1what
... 12 more tokens for doc 1 ...
2000[ 0.31, 0.05, -0.17, ..., 0.22 ]2[CLS]
... 157,722 more rows ...
ID SCHEME
token_id = doc_id × 1000 + tok_idx
doc 0 takes IDs 0–9, doc 1 takes 1000–1014, doc 42 takes 42000–42007, etc.

2. Query flow (the funnel)

You can't ColBERT-score every doc. So narrow the field with cheap dense ANN first, then apply ColBERT to the survivors.

Full corpus · 10,000,000 docs
↓ Stage 1: dense ANN against 1536-dim vectors · ~15 ms server-side
Top 100 candidates (rough ordering)
↓ Stage 2: ColBERT MaxSim against 128-dim token vectors · ~130 ms server-side
Top 10 reranked (precise ordering)

Stage 1 is fast but blurry (one similarity per doc), so the right answer might land at rank 47 instead of rank 1.
Stage 2 is slow but sharp (~480 similarities per doc), and can only fit if we first narrow to 100 candidates.


3. MaxSim, in tiny numbers

The ColBERT score for one (query, doc) pair is the sum of per-query-token best matches. Here it is with 3-dim unit vectors so the arithmetic is doable in your head.

Set up the vectors

Query: "make money"                      Doc A: "earn cash"
q1 = "make"  = [0.6, 0.8, 0.0]           d1 = "earn"  = [0.5, 0.7, 0.1]
q2 = "money" = [0.0, 0.5, 0.9]           d2 = "cash"  = [0.1, 0.4, 0.9]

Compute every pairwise dot product

Dot product of two unit vectors is a similarity in [–1, 1].

d1 = earnd2 = cash
q1 = make0.860.38
q2 = money0.440.99
make · earn  = 0.6×0.5 + 0.8×0.7 + 0.0×0.1 = 0.86
make · cash  = 0.6×0.1 + 0.8×0.4 + 0.0×0.9 = 0.38
money · earn = 0.0×0.5 + 0.5×0.7 + 0.9×0.1 = 0.44
money · cash = 0.0×0.1 + 0.5×0.4 + 0.9×0.9 = 0.99

"Max": each query token keeps its best doc match

q1 = "make"  → max(0.86, 0.38) = 0.86   (best: "earn")
q2 = "money" → max(0.44, 0.99) = 0.99   (best: "cash")

"Sum": add up across query tokens

Doc A score = 0.86 + 0.99 = 1.85

Compare to another doc

Doc A: "earn cash"
make  → best match "earn"  = 0.86
money → best match "cash"  = 0.99
                       ────
              score = 1.85
Doc B: "buy shoes"
make  → best match "buy"   = 0.21
money → best match "shoes" = 0.18
                       ────
              score = 0.39

Doc A wins because every query word found a strong match, even though "earn" and "cash" aren't the same words as "make" and "money". Dense retrieval averages this signal away; MaxSim keeps each word independent.

What real ColBERT does

vector dims
128 instead of 3
query tokens
32 (with [MASK] padding for "query expansion")
doc tokens
~15 per Quora question, up to 180 per longer doc
pairwise table
32 × 15 = 480 dot products per candidate
score per doc
one number (sum of 32 row-wise maxes)
Who computes what

The dot products above show what the score means. You never multiply vectors yourself. Since ColBERT vectors are unit-normalized, dot product = cosine similarity, and turbopuffer computes those server-side during the ANN search on the token namespace. You send query vectors; turbopuffer returns $dist per match. Your only jobs are (1) flip distance back to similarity with 1 − $dist, (2) keep the max per doc, (3) sum across query tokens.


4. How turbopuffer executes stage 2

Naive path: fetch every doc's token vectors to the client, compute MaxSim locally. Lots of network, lots of math.

Server-side path (what the guide uses): for each of the 32 query token vectors, send an ANN query into the token namespace, filtered to the 100 candidate doc IDs. turbopuffer runs the dot products in its ANN engine and returns the nearest doc tokens with distances attached. The client never multiplies vectors; it just aggregates scores.

for each query_token q (32 total):
    hits = token_ns.query(
        rank_by=("vector", "ANN", q),            # turbopuffer computes q · v
        filters=("doc_id", "In", candidate_doc_ids),
        top_k=1500
    )
    # Each hit comes back as (doc_id, $dist)
    # where $dist = 1 − (q · v). Flip it back:
    for row in hits:
        sim = 1.0 - row["$dist"]                 # recover similarity
        if sim > best_per_doc[row["doc_id"]]:
            best_per_doc[row["doc_id"]] = sim    # max per doc

    for doc_id, sim in best_per_doc.items():
        scores[doc_id] += sim                    # sum across query tokens

The only multiplication in your code is not a multiplication at all. It's the 1 − $dist subtraction to convert turbopuffer's distance back into similarity. All actual vector math (the dot products that define similarity) runs inside turbopuffer's ANN engine.

Those 32 queries are batched 16 at a time into multi_query, so stage 2 is 2 API calls regardless of candidate count. No raw doc vectors cross the network.


5. Putting it together for one query

Query: "How can I make money online free of cost?"

Stage 1: dense returns top 100, ordered by cosine similarity:
  rank  doc_id  text
  ──────────────────────────────────────────────────────────
  1     273     "How do I earn money online without investment?"
  2     891     "Best ways to make passive income"
  3     42      "Free online income opportunities"
  ...
  10    0       "How do I make money online?"          ← actual best match
  ...
  47    9921    "Online business ideas for beginners"
  ...
  100   5043    "How to budget your monthly expenses"

Stage 2: ColBERT rescores those 100:
  rank  doc_id  colbert_score
  ──────────────────────────────────────────────────────────
  1     0       14.7   "How do I make money online?"   ← promoted from rank 10
  2     273     13.9   "How do I earn money online..."
  3     42      13.2   "Free online income opportunities"
  ...

ColBERT promoted doc 0 because every query word (make, money, online) found an exact token match. Dense ranked it #10 because the extra query words (free, of, cost) diluted the averaged similarity. MaxSim doesn't care about averages. Each query word picks its own best match, independently.


6. How we measure quality: MRR@10

One query is an anecdote. To compare dense vs. ColBERT across a whole dataset, we need a score that summarizes "did search rank the right answer near the top?" across many queries. The standard metric is MRR@10, Mean Reciprocal Rank at 10.

6a. The scoring rule

For each test query you know the correct answer. Find where it lands in your results. The closer to position 1, the more credit you get.

score for one query = 1 / rank of the correct answer
(or 0 if it's not in the top 10)
MRR@10 = average of that score across all test queries
1.00
0.50
0.33
0.25
0.20
0.17
0.14
0.13
0.11
0.10
0.00
rank 1
2
3
4
5
6
7
8
9
10
not top-10

The "@10" just means "look at the top 10 results only, anything past that scores zero." MRR@5 or MRR@100 work the same way with different cutoffs.

6b. Worked example: 5 test queries

Imagine you have 5 test queries, and for each one you know the correct answer. Run them all, find the rank of the right answer, compute 1/rank, average.

query correct answer landed at reciprocal rank
"make money online" rank 1 1.00
"budget apps for students" rank 3 0.33
"how to learn Python" rank 2 0.50
"remote jobs paying six figures" not in top 10 0.00
"credit score tips" rank 1 1.00
MRR@10 = (1.00 + 0.33 + 0.50 + 0.00 + 1.00) / 5 0.566

Higher is better. 1.0 is perfect (every correct answer was rank 1). 0.0 means the correct answer was never in the top 10.

6c. What the numbers feel like

MRR@10 what it feels like distribution
1.00 Perfect. Correct answer always at rank 1.
0.80 Usually rank 1, sometimes rank 2. Really good.
0.50 Half the time right on 1, or always on 2.
0.20 Correct answer buried around rank 5.
0.00 Correct answer never in top 10.

6d. The actual result on Quora

Running MRR@10 on 100 known duplicate pairs from the Quora dataset:

Dense alone
0.845

OpenAI text-embedding-3-small. Correct answer almost always at rank 1 or 2.

Dense + ColBERT rerank
0.814

Same corpus, same queries, reranked by MaxSim over ColBERT tokens.

Both scores are high because Quora questions are short, and a single OpenAI vector captures short text well. ColBERT's per-token precision doesn't get a chance to shine when there's no lost detail to recover. On longer documents (support articles, contracts, product catalogs) the gap flips in ColBERT's favor.


7. Looking ahead: multi-vector support

The current implementation works, but it leaks complexity into user code: two namespaces, an ID-hack to link them, a doc_id filter, 32 token-level ANN searches batched into 2 multi_query calls, and client-side aggregation. All of that exists because turbopuffer stores one vector per row today. If rows could hold a list of vectors, most of that complexity collapses.

7a. The data model

Today · 2 namespaces, 11 rows for 1 doc
Namespace 1: dense
idvectortext
0[0.01,-0.42,...]"How do I make..."
Namespace 2: tokens (10 more rows for this doc)
idvectordoc_id
0[0.18,-0.04,...]0
1[0.09, 0.31,...]0
2[-0.21,0.08,...]0
... 7 more rows, all with doc_id=0 ...

To gather everything about doc 0: read 1 row from namespace 1, plus 10 rows from namespace 2, linked via the ID scheme doc_id × 1000 + tok_idx and a filterable doc_id.

Tomorrow · 1 namespace, 1 row for 1 doc
Namespace: unified
iddense (1536)colbert_tokens (N × 128)text
0 [0.01,-0.42,...] [
  [0.18,-0.04,...],
  [0.09, 0.31,...],
  [-0.21,0.08,...],
  ... 7 more ...
]
"How do I make..."

Everything about doc 0 is one row. The colbert_tokens cell is a variable-length list of 128-dim vectors. No ID scheme, no filter, no cross-namespace joins.

7b. The schema that makes it work

The server can only score multi-vector columns if it knows the column holds a list of vectors. That's what the schema declares:

ns.write(
    upsert_rows=[...],
    distance_metric={
        "dense":          "cosine_distance",
        "colbert_tokens": "cosine_distance",
    },
    schema={
        "dense":          {"type": "vector[1536]"},     # single vector per row
        "colbert_tokens": {"type": "vector[128][]"},    # ← list of 128-dim vectors
        "text":           {"type": "string"},
    },
)

vector[128][] tells the server two things: (1) build an ANN index across all tokens from all rows, with back-references to the row they came from; (2) this column is eligible to appear as the first argument to a multi-vector operator like MaxSim.

7c. The query: one call, two stages

The entire late-interaction pipeline
results = ns.query(
    rank_by=[
        ("dense", "ANN", dense_vec(query)),           # stage 1
        ("colbert_tokens", "MaxSim", q_tokens),       # stage 2
    ],
    candidates=100,        # width of stage-1 pool
    top_k=10,              # number of stage-2 results to return
    include_attributes=["text"],
)

One HTTP request. One response. No multi_query, no client-side aggregation, no 1 − $dist arithmetic, no per-token bookkeeping.

7d. How the server executes it

1Parse rank_by as a pipeline
The server reads the rank_by list as stage 1 → stage 2. Each tuple is (column, operator, argument). The schema tells it dense is a vector[1536] and colbert_tokens is a vector[128][]. The operator names are dispatched to built-in implementations: ANN for regular vectors, MaxSim for multi-vector columns.
2Run stage 1: dense ANN
Execute ANN index lookup on the `dense` column with query D.
Result: [(row_id, dist)] × 100 candidates

  row_id    dist
  ────────  ──────
  273       0.12
  891       0.14
  42        0.16
  ...
  0         0.31   ← actual best, buried at rank ~10
  ...
  5043      0.42
3Load colbert_tokens for the 100 candidates

This is a plain column read, not an ANN search. The token list lives on the same row as the dense vector that stage 1 already located.

for row_id in candidates:
    doc_tokens[row_id] = storage.get(row_id, "colbert_tokens")
    # e.g. doc_tokens[0] = [[0.18,-0.04,...], [0.09,0.31,...], ... 10 vecs]
4Run stage 2: MaxSim on those 100 rows
Q = q_tokens   # shape (32, 128)

for row_id in candidates:
    V = doc_tokens[row_id]                    # (N, 128)
    sim_matrix = Q @ V.T                      # (32, N), one matmul
    row.score = sim_matrix.max(axis=1).sum()  # row-wise max, then sum

sort candidates by score desc
return top 10 with requested attributes

7e. The response shape

Because stage 2 is the final ranker, each row comes back with a single $score, the aggregated MaxSim value. No per-query-token distances, no client math:

[
  { "id": 0,   "$score": 14.72, "text": "How do I make money online?" },
  { "id": 273, "$score": 13.90, "text": "How do I earn money online..." },
  { "id": 42,  "$score": 13.21, "text": "Free online income opportunities" },
  ...
]

For debugging or research use cases (like "why did this doc rank here?"), an opt-in explain mode could return the 32 per-query-token alignments per doc:

{
  "id": 0, "$score": 14.72,
  "$breakdown": [
    { "q_idx": 5, "best_doc_tok": 5, "sim": 0.96 },  # "make" → "make"
    { "q_idx": 6, "best_doc_tok": 6, "sim": 0.97 },  # "money" → "money"
    { "q_idx": 7, "best_doc_tok": 7, "sim": 0.95 },  # "online" → "online"
    ... 29 more ...
  ]
}

7f. Side-by-side

Today (2 namespaces) Multi-vector storage only Multi-vector + MaxSim operator
Namespaces 2 1 1
API calls per query 1 dense + 2 multi_query = 3 1 (plus larger response) 1
ID scheme / doc_id filter required not needed not needed
Data downloaded per query ~40 KB ~800 KB (raw token vectors) ~2 KB
Client-side math flip + max + sum loop one matmul per candidate none
Scales to 1000 candidates? slow ~8 MB/query yes, cheap
Server can parallelize? partially (per multi_query) no (client computes) fully
The shift

Today, the client orchestrates late interaction: it decides the ordering of API calls, carries intermediate distances around, and does the final MaxSim arithmetic. With multi-vector storage and a MaxSim operator, the client simply declares the scoring recipe in one rank_by expression. turbopuffer owns the pipeline: it can fuse the ANN hop with the MaxSim scoring, parallelize across cores, and cache hot candidates. None of those optimizations are possible when the client is in the driver's seat.

Built as part of the turbopuffer late interaction guide · full guide →