What the namespaces hold, and how the query flows through them.
Companion page: the Top-K problem in late interaction →
Two namespaces. One dense vector per document. One ColBERT vector per token.
late-interaction-test (dense)One row per document · 10,000 rows · 62 MB · 1536-dim OpenAI embeddings
| id | vector (1536 dims) | text |
|---|---|---|
| 0 | [ 0.013, -0.421, 0.087, ..., 0.204 ] | How do I make money online? |
| 1 | [ 0.092, 0.155, -0.301, ..., -0.088 ] | What are the best ways to earn money from home? |
| 2 | [-0.044, 0.288, 0.117, ..., 0.331 ] | How can I start a successful online business? |
| 3 | [ 0.211, -0.067, 0.392, ..., 0.018 ] | What programming languages should I learn first? |
| ... 9,996 more rows ... | ||
| 9999 | [-0.173, 0.252, -0.041, ..., 0.110 ] | ... |
late-interaction-tokens-test (ColBERT tokens)One row per token · 157,736 rows · 83 MB · 128-dim ColBERT embeddings · doc_id is filterable
| id | vector (128 dims) | doc_id | token (not stored) |
|---|---|---|---|
| 0 | [ 0.18, -0.04, 0.22, ..., 0.09 ] | 0 | [CLS] |
| 1 | [ 0.09, 0.31, -0.12, ..., 0.27 ] | 0 | [D] |
| 2 | [-0.21, 0.08, 0.45, ..., -0.03 ] | 0 | how |
| 3 | [ 0.33, -0.17, 0.02, ..., 0.14 ] | 0 | do |
| 4 | [ 0.05, 0.29, -0.31, ..., 0.18 ] | 0 | i |
| 5 | [ 0.41, 0.13, 0.07, ..., -0.22 ] | 0 | make |
| 6 | [-0.08, 0.36, 0.19, ..., 0.31 ] | 0 | money |
| 7 | [ 0.22, -0.11, 0.28, ..., 0.06 ] | 0 | online |
| 8 | [ 0.14, 0.07, -0.05, ..., 0.19 ] | 0 | ? |
| 9 | [-0.18, 0.24, 0.31, ..., 0.02 ] | 0 | [SEP] |
| ... tokens 10–999 empty (doc 0 only has 10 tokens) ... | |||
| 1000 | [ 0.27, -0.09, 0.14, ..., 0.33 ] | 1 | [CLS] |
| 1001 | [ 0.11, 0.33, -0.21, ..., 0.04 ] | 1 | [D] |
| 1002 | [-0.14, 0.19, 0.42, ..., -0.11 ] | 1 | what |
| ... 12 more tokens for doc 1 ... | |||
| 2000 | [ 0.31, 0.05, -0.17, ..., 0.22 ] | 2 | [CLS] |
| ... 157,722 more rows ... | |||
You can't ColBERT-score every doc. So narrow the field with cheap dense ANN first, then apply ColBERT to the survivors.
Stage 1 is fast but blurry (one similarity per doc), so the right answer might land at rank 47 instead of rank 1.
Stage 2 is slow but sharp (~480 similarities per doc), and can only fit if we first narrow to 100 candidates.
The ColBERT score for one (query, doc) pair is the sum of per-query-token best matches. Here it is with 3-dim unit vectors so the arithmetic is doable in your head.
Query: "make money" Doc A: "earn cash" q1 = "make" = [0.6, 0.8, 0.0] d1 = "earn" = [0.5, 0.7, 0.1] q2 = "money" = [0.0, 0.5, 0.9] d2 = "cash" = [0.1, 0.4, 0.9]
Dot product of two unit vectors is a similarity in [–1, 1].
| d1 = earn | d2 = cash | |
|---|---|---|
| q1 = make | 0.86 | 0.38 |
| q2 = money | 0.44 | 0.99 |
make · earn = 0.6×0.5 + 0.8×0.7 + 0.0×0.1 = 0.86 make · cash = 0.6×0.1 + 0.8×0.4 + 0.0×0.9 = 0.38 money · earn = 0.0×0.5 + 0.5×0.7 + 0.9×0.1 = 0.44 money · cash = 0.0×0.1 + 0.5×0.4 + 0.9×0.9 = 0.99
q1 = "make" → max(0.86, 0.38) = 0.86 (best: "earn") q2 = "money" → max(0.44, 0.99) = 0.99 (best: "cash")
Doc A score = 0.86 + 0.99 = 1.85
make → best match "earn" = 0.86
money → best match "cash" = 0.99
────
score = 1.85
make → best match "buy" = 0.21
money → best match "shoes" = 0.18
────
score = 0.39
Doc A wins because every query word found a strong match, even though "earn" and "cash" aren't the same words as "make" and "money". Dense retrieval averages this signal away; MaxSim keeps each word independent.
The dot products above show what the score means. You never multiply vectors yourself. Since ColBERT vectors are unit-normalized, dot product = cosine similarity, and turbopuffer computes those server-side during the ANN search on the token namespace. You send query vectors; turbopuffer returns $dist per match. Your only jobs are (1) flip distance back to similarity with 1 − $dist, (2) keep the max per doc, (3) sum across query tokens.
Naive path: fetch every doc's token vectors to the client, compute MaxSim locally. Lots of network, lots of math.
Server-side path (what the guide uses): for each of the 32 query token vectors, send an ANN query into the token namespace, filtered to the 100 candidate doc IDs. turbopuffer runs the dot products in its ANN engine and returns the nearest doc tokens with distances attached. The client never multiplies vectors; it just aggregates scores.
for each query_token q (32 total):
hits = token_ns.query(
rank_by=("vector", "ANN", q), # turbopuffer computes q · v
filters=("doc_id", "In", candidate_doc_ids),
top_k=1500
)
# Each hit comes back as (doc_id, $dist)
# where $dist = 1 − (q · v). Flip it back:
for row in hits:
sim = 1.0 - row["$dist"] # recover similarity
if sim > best_per_doc[row["doc_id"]]:
best_per_doc[row["doc_id"]] = sim # max per doc
for doc_id, sim in best_per_doc.items():
scores[doc_id] += sim # sum across query tokens
The only multiplication in your code is not a multiplication at all. It's the 1 − $dist subtraction to convert turbopuffer's distance back into similarity. All actual vector math (the dot products that define similarity) runs inside turbopuffer's ANN engine.
Those 32 queries are batched 16 at a time into multi_query, so stage 2 is 2 API calls regardless of candidate count. No raw doc vectors cross the network.
Query: "How can I make money online free of cost?" Stage 1: dense returns top 100, ordered by cosine similarity: rank doc_id text ────────────────────────────────────────────────────────── 1 273 "How do I earn money online without investment?" 2 891 "Best ways to make passive income" 3 42 "Free online income opportunities" ... 10 0 "How do I make money online?" ← actual best match ... 47 9921 "Online business ideas for beginners" ... 100 5043 "How to budget your monthly expenses" Stage 2: ColBERT rescores those 100: rank doc_id colbert_score ────────────────────────────────────────────────────────── 1 0 14.7 "How do I make money online?" ← promoted from rank 10 2 273 13.9 "How do I earn money online..." 3 42 13.2 "Free online income opportunities" ...
ColBERT promoted doc 0 because every query word (make, money, online) found an exact token match. Dense ranked it #10 because the extra query words (free, of, cost) diluted the averaged similarity. MaxSim doesn't care about averages. Each query word picks its own best match, independently.
One query is an anecdote. To compare dense vs. ColBERT across a whole dataset, we need a score that summarizes "did search rank the right answer near the top?" across many queries. The standard metric is MRR@10, Mean Reciprocal Rank at 10.
For each test query you know the correct answer. Find where it lands in your results. The closer to position 1, the more credit you get.
The "@10" just means "look at the top 10 results only, anything past that scores zero." MRR@5 or MRR@100 work the same way with different cutoffs.
Imagine you have 5 test queries, and for each one you know the correct answer. Run them all, find the rank of the right answer, compute 1/rank, average.
| query | correct answer landed at | reciprocal rank |
|---|---|---|
| "make money online" | rank 1 | 1.00 |
| "budget apps for students" | rank 3 | 0.33 |
| "how to learn Python" | rank 2 | 0.50 |
| "remote jobs paying six figures" | not in top 10 | 0.00 |
| "credit score tips" | rank 1 | 1.00 |
| MRR@10 = (1.00 + 0.33 + 0.50 + 0.00 + 1.00) / 5 | 0.566 | |
Higher is better. 1.0 is perfect (every correct answer was rank 1). 0.0 means the correct answer was never in the top 10.
| MRR@10 | what it feels like | distribution |
|---|---|---|
| 1.00 | Perfect. Correct answer always at rank 1. | |
| 0.80 | Usually rank 1, sometimes rank 2. Really good. | |
| 0.50 | Half the time right on 1, or always on 2. | |
| 0.20 | Correct answer buried around rank 5. | |
| 0.00 | Correct answer never in top 10. |
Running MRR@10 on 100 known duplicate pairs from the Quora dataset:
OpenAI text-embedding-3-small. Correct answer almost always at rank 1 or 2.
Same corpus, same queries, reranked by MaxSim over ColBERT tokens.
Both scores are high because Quora questions are short, and a single OpenAI vector captures short text well. ColBERT's per-token precision doesn't get a chance to shine when there's no lost detail to recover. On longer documents (support articles, contracts, product catalogs) the gap flips in ColBERT's favor.
The current implementation works, but it leaks complexity into user code: two namespaces, an ID-hack to link them, a doc_id filter, 32 token-level ANN searches batched into 2 multi_query calls, and client-side aggregation. All of that exists because turbopuffer stores one vector per row today. If rows could hold a list of vectors, most of that complexity collapses.
| id | vector | text |
|---|---|---|
| 0 | [0.01,-0.42,...] | "How do I make..." |
| id | vector | doc_id |
|---|---|---|
| 0 | [0.18,-0.04,...] | 0 |
| 1 | [0.09, 0.31,...] | 0 |
| 2 | [-0.21,0.08,...] | 0 |
| ... 7 more rows, all with doc_id=0 ... | ||
To gather everything about doc 0: read 1 row from namespace 1, plus 10 rows from namespace 2, linked via the ID scheme doc_id × 1000 + tok_idx and a filterable doc_id.
| id | dense (1536) | colbert_tokens (N × 128) | text |
|---|---|---|---|
| 0 | [0.01,-0.42,...] |
[ [0.18,-0.04,...], [0.09, 0.31,...], [-0.21,0.08,...], ... 7 more ... ] |
"How do I make..." |
Everything about doc 0 is one row. The colbert_tokens cell is a variable-length list of 128-dim vectors. No ID scheme, no filter, no cross-namespace joins.
The server can only score multi-vector columns if it knows the column holds a list of vectors. That's what the schema declares:
ns.write(
upsert_rows=[...],
distance_metric={
"dense": "cosine_distance",
"colbert_tokens": "cosine_distance",
},
schema={
"dense": {"type": "vector[1536]"}, # single vector per row
"colbert_tokens": {"type": "vector[128][]"}, # ← list of 128-dim vectors
"text": {"type": "string"},
},
)
vector[128][] tells the server two things: (1) build an ANN index across all tokens from all rows, with back-references to the row they came from; (2) this column is eligible to appear as the first argument to a multi-vector operator like MaxSim.
results = ns.query(
rank_by=[
("dense", "ANN", dense_vec(query)), # stage 1
("colbert_tokens", "MaxSim", q_tokens), # stage 2
],
candidates=100, # width of stage-1 pool
top_k=10, # number of stage-2 results to return
include_attributes=["text"],
)
One HTTP request. One response. No multi_query, no client-side aggregation, no 1 − $dist arithmetic, no per-token bookkeeping.
rank_by list as stage 1 → stage 2. Each tuple is (column, operator, argument). The schema tells it dense is a vector[1536] and colbert_tokens is a vector[128][]. The operator names are dispatched to built-in implementations: ANN for regular vectors, MaxSim for multi-vector columns.
Execute ANN index lookup on the `dense` column with query D. Result: [(row_id, dist)] × 100 candidates row_id dist ──────── ────── 273 0.12 891 0.14 42 0.16 ... 0 0.31 ← actual best, buried at rank ~10 ... 5043 0.42
This is a plain column read, not an ANN search. The token list lives on the same row as the dense vector that stage 1 already located.
for row_id in candidates:
doc_tokens[row_id] = storage.get(row_id, "colbert_tokens")
# e.g. doc_tokens[0] = [[0.18,-0.04,...], [0.09,0.31,...], ... 10 vecs]
Q = q_tokens # shape (32, 128) for row_id in candidates: V = doc_tokens[row_id] # (N, 128) sim_matrix = Q @ V.T # (32, N), one matmul row.score = sim_matrix.max(axis=1).sum() # row-wise max, then sum sort candidates by score desc return top 10 with requested attributes
Because stage 2 is the final ranker, each row comes back with a single $score, the aggregated MaxSim value. No per-query-token distances, no client math:
[
{ "id": 0, "$score": 14.72, "text": "How do I make money online?" },
{ "id": 273, "$score": 13.90, "text": "How do I earn money online..." },
{ "id": 42, "$score": 13.21, "text": "Free online income opportunities" },
...
]
For debugging or research use cases (like "why did this doc rank here?"), an opt-in explain mode could return the 32 per-query-token alignments per doc:
{
"id": 0, "$score": 14.72,
"$breakdown": [
{ "q_idx": 5, "best_doc_tok": 5, "sim": 0.96 }, # "make" → "make"
{ "q_idx": 6, "best_doc_tok": 6, "sim": 0.97 }, # "money" → "money"
{ "q_idx": 7, "best_doc_tok": 7, "sim": 0.95 }, # "online" → "online"
... 29 more ...
]
}
| Today (2 namespaces) | Multi-vector storage only | Multi-vector + MaxSim operator | |
|---|---|---|---|
| Namespaces | 2 | 1 | 1 |
| API calls per query | 1 dense + 2 multi_query = 3 | 1 (plus larger response) | 1 |
| ID scheme / doc_id filter | required | not needed | not needed |
| Data downloaded per query | ~40 KB | ~800 KB (raw token vectors) | ~2 KB |
| Client-side math | flip + max + sum loop | one matmul per candidate | none |
| Scales to 1000 candidates? | slow | ~8 MB/query | yes, cheap |
| Server can parallelize? | partially (per multi_query) | no (client computes) | fully |
Today, the client orchestrates late interaction: it decides the ordering of API calls, carries intermediate distances around, and does the final MaxSim arithmetic. With multi-vector storage and a MaxSim operator, the client simply declares the scoring recipe in one rank_by expression. turbopuffer owns the pipeline: it can fuse the ANN hop with the MaxSim scoring, parallelize across cores, and cache hot candidates. None of those optimizations are possible when the client is in the driver's seat.
Built as part of the turbopuffer late interaction guide · full guide →