AI & LLMsJune 3, 2026 7 min read

Hybrid Search for RAG: When BM25 Beats Your Vector Database

Pure vector search loses on acronyms, product codes, and rare names. Here's how we mix BM25 with embeddings to fix recall without rewriting the stack.

Every team that ships a RAG product eventually hits the same wall: the embeddings are great at "what does this mean" and terrible at "find me the document that literally says SKU-99421-B". The fix is older than transformers, and it's BM25.

This is a working guide to hybrid search — why it matters in 2026, how to wire it up, and the failure modes that bite in production.

Why Pure Vector Search Quietly Underperforms

Dense embeddings compress meaning. That's the feature and the bug. When a user types error E_AUTH_4413 on staging, what they want is the one runbook that contains that exact string. What they get from cosine similarity is five tangentially related auth troubleshooting docs, ranked by vibe.

We see this most often with:

Product codes, error codes, SKUs — tokens the embedding model has never seen as a unit.
Person and company names — especially non-English or recently coined.
Code identifiers — getUserByIdV2 and getUserById mean very different things in a codebase.
Negations and exact phrases — "not eligible for refund" vs "eligible for refund" can embed too close together.
Long-tail jargon — internal acronyms that show up in three docs and nowhere on the public internet.

BM25, the lexical scoring function powering Elasticsearch, OpenSearch, Lucene, and most SQL full-text indexes, doesn't care about meaning. It cares about whether the token is there, how rare it is across the corpus, and how dense it is in the candidate document. For exact-match queries, that's exactly what you want.

The pitch for hybrid is simple: run both, combine the rankings, and let the LLM see a candidate set that doesn't miss the obvious answer.

A Minimal Hybrid Pipeline

Here's the shape of the pipeline we reach for. It works with Postgres + pgvector, OpenSearch, Weaviate, Qdrant, or a roll-your-own setup — the components are the same.

query
  ├── BM25 retriever ────► top 50 lexical hits
  └── dense retriever ───► top 50 semantic hits
                │
                ▼
        fusion (RRF or weighted)
                │
                ▼
         cross-encoder reranker
                │
                ▼
         top 5–10 to the LLM

Two retrievers, one fusion step, one reranker. The reranker is optional but usually worth it — more on that below.

Reciprocal Rank Fusion: the boring choice that wins

Reciprocal Rank Fusion (RRF) is the default we recommend. It ignores raw scores (which are not comparable between BM25 and cosine similarity) and combines purely on rank position:

def rrf(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    """
    rankings: list of ranked doc_id lists, one per retriever
    k: smoothing constant (60 is the value from the original paper)
    """
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))

Why RRF over weighted score fusion?

BM25 scores are unbounded and corpus-dependent. Cosine similarity is bounded but not calibrated. Normalising them is a tuning rabbit hole.
RRF needs zero training data and no hyperparameter sweep beyond k.
It degrades gracefully when one retriever returns garbage — the other still anchors the ranking.

Weighted fusion can beat RRF if you have labelled eval data and you're willing to maintain a calibration pipeline. Most teams don't, and the gap is small.

Where Hybrid Actually Moves the Needle

We've shipped hybrid retrieval on internal knowledge bases, e-commerce catalogues, legal corpora, and developer docs. The recall lift over pure vector search varies, but in our experience it's biggest in three places:

Support and ops knowledge bases. Tickets reference error codes and version numbers that embeddings smear together.
Code and API search. Identifiers, file paths, and CLI flags need exact match.
Catalogues with structured identifiers. SKUs, ISBNs, part numbers, model names.

Where the lift is smaller: conversational FAQ corpora, marketing content, and anything where users ask in full natural language and the docs are written the same way. Pure dense retrieval is already close to ceiling there.

The reranker tax is usually worth paying

Fusion gives you a candidate set of maybe 50–100 documents. A cross-encoder reranker — Cohere Rerank, Voyage rerank-2, or an open model like BGE reranker — scores each (query, doc) pair jointly and reorders. This is where you recover the precision that fusion alone leaves on the table.

The tradeoffs to know, per the vendor docs:

Latency: rerankers add 100–400ms for ~50 candidates depending on model and provider.
Cost: hosted rerankers charge per search, typically a fraction of a cent. Cheaper than letting the LLM thrash on bad context.
Context length: most rerankers truncate at 512–1024 tokens per doc. Long documents need to be chunked the same way they were for embedding.

If you're already paying for vector search and an LLM call, a reranker is the highest-ROI addition you can make. We'd add it before tuning anything else.

Implementation Notes That Will Save You a Week

Chunk the same text for both retrievers

Mismatched chunking is the most common bug we see. If your BM25 index has full pages and your vector index has 400-token chunks, fusion will compare apples and oranges and your reranker will choke on inconsistent context. Pick one chunking strategy, index both ways from the same chunks, and store a stable chunk_id on each.

Preprocess for BM25, not for embeddings

BM25 benefits from lowercasing, stemming or lemmatising, and stopword removal. Embedding models prefer raw text — modern tokenisers handle casing and morphology themselves. Run two preprocessing paths. Don't share one.

Filter before you fuse

Metadata filters (tenant ID, language, document type, ACL) should apply to both retrievers before fusion. Filtering after fusion can leave you with empty result sets or leak data across tenants. Most vector DBs and search engines support pre-filtering natively — use it.

Cache the embedding, not the BM25 query

Embedding the query is the expensive part of dense retrieval. Cache it by query string hash with a short TTL. BM25 is fast enough that caching adds complexity without meaningful savings.

Watch for the "both retrievers agree on the wrong doc" failure

Fusion makes you confident when both retrievers surface the same doc. If your corpus has near-duplicates — old and new versions of the same policy, for example — both will agree on the wrong one. Dedupe by content hash or canonical URL before fusion, and prefer the newer version on tie.

Measuring Whether It Actually Helps

Don't ship hybrid retrieval because a blog post said to. Measure it. The cheap eval setup:

Build a set of 100–300 real queries from your logs, with the correct chunk ID labelled (human-labelled, or LLM-labelled then spot-checked).
Run three configurations: BM25-only, vector-only, hybrid + rerank.
Report recall@10 (did we get the right chunk in the top 10?) and MRR (mean reciprocal rank of the correct chunk).

If hybrid doesn't beat the better of the two single retrievers on your eval set by a clear margin, something is wrong with chunking, preprocessing, or the labels. Don't ship until it does.

For anything more involved, we've written about building LLM evals that catch regressions — the same harness works for retrieval.

The Cost Picture

A rough sketch of where the money goes in a hybrid pipeline per query:

BM25 search: effectively free if you already run Postgres or OpenSearch.
Dense retrieval: query embedding (cheap, often cached) + ANN search (cheap).
Reranker: small per-call fee, scales with candidate count.
LLM generation: the dominant cost, and the one hybrid retrieval reduces by sending smaller, better context.

That last point matters. Better retrieval means you can drop from sending 15 chunks to sending 5, which shrinks input tokens on every generation call. In our experience that saving more than pays for the reranker.

Where We'd Start

If you're running pure vector RAG today and recall feels off, do this in order:

Add BM25 over the same chunks. Postgres tsvector or OpenSearch is fine — you don't need a new vendor.
Fuse with RRF, k=60, top 50 from each side.
Add a hosted reranker on the fused candidates. Pick whichever your vector DB integrates with already.
Build a 100-query eval set from real logs before you tune anything else.
Only after that, consider query rewriting, HyDE, or multi-vector retrieval.

Hybrid search isn't glamorous. It's a 1970s algorithm bolted onto a 2020s one, held together with a fusion formula from 2009. It also happens to be the single biggest retrieval quality win most RAG systems have left on the table. If you want help wiring it into a production stack, our AI engineering team does this work week in and week out.

#RAG#Retrieval#Search#LLM Engineering#Vector Databases

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Eval Harnesses That Catch Regressions Before Users Do

Most teams write prompts, ship, and pray. Here's how we build eval harnesses that actually catch regressions before a model swap or prompt tweak breaks production.

July 26, 2026 6 min

Token Budgets Per Request: How to Stop Your Agent From Bankrupting a Feature

One runaway agent loop can eat a week of margin. Here's how we set per-request token budgets, enforce them at the SDK layer, and keep product features profitable without lobotomising the model.

July 23, 2026 6 min

Long-Context Windows vs RAG: When 1M Tokens Actually Beats Retrieval

Gemini and Claude now ship million-token windows. That doesn't mean you should stuff everything into the prompt. Here's how we decide between long context and RAG on real projects.

July 21, 2026 7 min