AI & LLMsJune 16, 2026 6 min read

Reranking in RAG: When a Cross-Encoder Earns Its Latency

Rerankers fix the recall-precision gap in RAG, but they cost latency and money. Here's when a cross-encoder actually pays off, and when you should tune retrieval instead.

Every RAG system we've shipped eventually hits the same wall: retrieval returns the right document at rank 7, the LLM only sees the top 5, and the answer comes back wrong. Reranking is the obvious fix — but a cross-encoder isn't free, and we've seen teams bolt one on when the real problem was a sloppy embedding pipeline.

This is a breakdown of when reranking actually earns its place in your pipeline, when it's masking a different bug, and how to measure the tradeoff without guessing.

What a reranker actually does

Your first-stage retriever — usually a dense vector search, sometimes hybrid with BM25 — is optimized for recall. It scores every chunk independently against the query and returns the top N by similarity. That's fast (sub-100ms for millions of vectors) but coarse. Embeddings are a lossy compression of meaning, and similarity in vector space doesn't always equal relevance to the actual question.

A reranker is a second-stage model that scores each (query, chunk) pair together. Cross-encoders concatenate the query and the candidate document, run them through a transformer, and produce a single relevance score. Because the model sees both inputs in full context, it catches nuances that bi-encoders miss — negation, entity matching, temporal qualifiers.

The tradeoff is brutal on paper: a bi-encoder embeds your corpus once and does cheap cosine similarity at query time. A cross-encoder has to run a full forward pass per candidate, every query. Rerank 50 candidates and you're doing 50 model calls before the LLM even sees anything.

The usual suspects

In 2026 the practical options haven't changed much:

Cohere Rerank 3.5 — hosted API, multilingual, ~100-300ms for 50 docs in our experience. See the Cohere Rerank docs for the current model list.
bge-reranker-v2-m3 and bge-reranker-v2-gemma from BAAI — open weights, run them yourself on a GPU. Solid quality, you own the latency budget.
Jina Reranker v2 — hosted or self-hosted, decent multilingual support.
Voyage rerank-2 — strong on technical and code-heavy corpora.

LLM-as-reranker (asking GPT-4 or Claude to score candidates) is a fourth option. It's expensive and slow but useful as an evaluation baseline.

When reranking is the right fix

We reach for a reranker when three conditions hold:

Recall@50 is high but Precision@5 is low. Your retriever is finding the answer, just not ranking it. This is the textbook reranker scenario.
The corpus has semantically similar but factually distinct chunks. Product specs, legal clauses, API versions, dated policy documents. Embeddings cluster them; a cross-encoder can separate them.
Queries are short and underspecified. When users type "refund policy 2025" and you have refund policies for five years, the cross-encoder's full-attention view of the query matters.

The diagnostic that matters: pull 100 production queries, manually label the correct chunks, and measure where they land in your retriever's top 50. If the right answer is in the top 50 more than 90% of the time but in the top 5 less than 60% of the time, reranking will move the needle. If it's missing from the top 50 entirely, no reranker can save you — fix retrieval first.

When reranking is masking a real bug

We've audited pipelines where the team added Cohere Rerank, saw a quality bump, and shipped it. Six months later they're paying for rerank calls that are papering over fixable problems:

Chunking is wrong. 200-token chunks that split a sentence in half will never embed well. The reranker can sometimes recover, but you're paying twice — once in storage, once in latency — for bad preprocessing.
Embedding model mismatch. Using a general-purpose model on a legal or medical corpus. A domain-tuned embedding model (or fine-tuned bi-encoder) can close most of the gap without a second stage.
No hybrid search. If you're doing pure dense retrieval and your queries contain product codes, error messages, or proper nouns, BM25 fusion will help more than reranking. We covered this in Hybrid Search for RAG.
Metadata filters are missing. A reranker shouldn't be doing the work of a WHERE tenant_id = ? clause.

The rule we follow: cheap fixes before expensive ones. Reranking is expensive.

A concrete pipeline

Here's the shape of a two-stage retrieval setup we've shipped, with knobs you'll actually want to tune.

from typing import List
import cohere

co = cohere.ClientV2()

async def retrieve(query: str, tenant_id: str, k_final: int = 5) -> List[dict]:
    # Stage 1: hybrid retrieval, wide net
    dense_hits = await vector_store.search(
        query_embedding=embed(query),
        filter={"tenant_id": tenant_id},
        top_k=40,
    )
    sparse_hits = await bm25.search(query, tenant_id=tenant_id, top_k=40)
    candidates = rrf_fuse(dense_hits, sparse_hits, k=60)[:50]

    if len(candidates) <= k_final:
        return candidates

    # Stage 2: rerank
    docs = [c["text"] for c in candidates]
    result = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=docs,
        top_n=k_final,
    )

    return [candidates[r.index] for r in result.results]

A few things worth noting:

Cap candidates at 50. Rerank latency scales roughly linearly with candidate count. We rarely see quality gains past 50, and the 95th percentile latency gets ugly fast above 100.
Short-circuit on small result sets. If retrieval returned 3 hits, don't pay for a rerank call.
RRF before rerank, not after. Reciprocal Rank Fusion is cheap and fixes a lot of dense/sparse disagreement before the expensive stage.
Pass tenant filters at the retrieval layer. Never rely on the reranker to enforce access control.

Latency budget math

Here's the napkin math we run before adding a reranker. Say your end-to-end budget is 3 seconds and the LLM call eats 1.8s (streaming first token). That leaves 1.2s for everything else: query rewriting, retrieval, reranking, prompt assembly, network overhead.

In our experience, hosted reranker p95 falls in the 150-400ms range for 50 candidates depending on document length and provider. Self-hosted bge-reranker-v2-m3 on an A10G is roughly 200-500ms for the same. That's a meaningful chunk of the budget — and it's purely additive to retrieval time.

If you can't absorb 300ms, you have two options: smaller candidate sets (rerank 20 instead of 50, accept some quality loss), or skip reranking for queries where retrieval confidence is already high. We've seen the second approach — gating reranking on the score gap between top-1 and top-10 — cut reranker calls by 40% with negligible quality impact.

Measuring whether it's working

Don't ship a reranker on vibes. Build a small eval set — 100-300 labeled queries with known-correct chunks — and measure:

Recall@k before and after reranking. The reranker shouldn't change recall (it can only reorder candidates), so this is a sanity check.
NDCG@5 or MRR. These reward putting the right answer near the top, which is what the reranker is for.
End-task accuracy. The metric that actually matters: does the final LLM answer get better? Sometimes reranking improves NDCG but the LLM was already finding the answer in the top 10. In that case you're paying for nothing.

We run this as part of CI on any retrieval change. If reranker quality drops more than 2 points on NDCG@5, the PR doesn't merge.

Cost guardrails

Reranker pricing is usually per-search-unit (Cohere) or per-token (some hosted providers). At ~$2 per 1k searches and 50 candidates each, a chatbot doing 500k queries a month is $1000 in rerank costs alone. Not catastrophic, but worth designing around:

Cache rerank results keyed on (normalized_query, candidate_ids_hash). High-traffic queries hit the cache.
Skip reranking on follow-up turns if the retrieval set is unchanged from the previous turn.
Use a smaller reranker for the common case and reserve the big one for low-confidence queries.

Self-hosting a bge-reranker on a shared GPU is often cheaper at scale, but you're now operating a model server. Worth it above ~1M queries/month in our experience; rarely worth it below.

Where we'd start

If you're adding reranking to an existing RAG system, do this in order: label 100 production queries, measure Recall@50 and Precision@5, fix chunking and hybrid search first, then drop in a hosted reranker behind a feature flag and A/B it against your eval set. Only self-host once you have volume and a clear quality ceiling on the API option. The reranker is a precision tool — don't use it as a recall crutch.

If you want a hand wiring evals and retrieval into your stack, that's most of what we do on our AI engineering work.

#RAG#Retrieval#LLM Engineering#Search

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Structured Outputs in Production: JSON Schema, Tool Calls, or Both?

JSON mode, strict schemas, and tool calls all promise reliable structured output from LLMs. They behave differently under load, failure, and schema drift. Here's how we pick between them.

June 19, 2026 7 min

Streaming Tool Calls: How to Keep Agents Responsive Without Breaking State

Streaming tool calls feel like a free win until your agent state diverges, your UI flickers, and your retries double-charge users. Here's how to ship it without the footguns.

June 13, 2026 6 min

Routing Between Claude, GPT, and Gemini: A Production Playbook

Picking one frontier model and praying is not a strategy. Here's how we route requests across Claude, GPT, and Gemini in production — by task shape, cost, and failure mode.

June 11, 2026 7 min