BM25 vs Vector Search: What I learned building three retrieval systems

When building HolaMundo, my AI language tutor, I needed to help students find relevant information from articles. Should I use keyword search, semantic search, or both?

To conduct an experiment, I built all three retrieval modes and tested them with real queries on a Wikipedia article about Spanish history. The results surprised me.

The two approaches

BM25: The keyword champion

BM25 is a scoring algorithm that evaluates how well a document matches a search query. It’s been the backbone of search engines for decades.

It considers a few inputs:

Term frequency (TF): How often a query term appears in a document. If a document contains the query term multiple times, it’s more relevant. But there’s a saturation effect. Beyond a certain point, additional occurrences don’t help. This prevents long documents from being unfairly favored.

Inverse document frequency (IDF): Importance of a term across the entire corpus. Rare terms are more informative than common ones. If everyone mentions “Spain,” but only one document mentions “Hispania,” that second term carries more weight.

Document length normalization: Scores are normalized so longer documents don’t automatically win.

BM25 looks for keywords

BM25 is robust and performs well across different datasets. It’s computationally efficient and suitable for real-time search. The scoring mechanism is transparent. We can see exactly why a result ranked highly. Parameters can be fine-tuned for specific tasks.

Where BM25 struggles is semantic understanding. “Automobile” and “car” are completely different terms to the algorithm. It doesn’t incorporate external knowledge or context, and performance degrades on highly specialized or noisy datasets.

Vector embeddings: The semantic powerhouse

Vector embeddings represent text as arrays of real numbers in high-dimensional space. Similar concepts cluster together. For instance, “cat” and “kitty” have nearly identical representations.

As data is inserted into a vector database, each object is converted into an embedding using a machine learning model. These embeddings are placed into an index for fast search. When we query, the search gets converted to a vector using the same model, then the database finds the closest vectors. The quality of search depends entirely on the quality of the embedding model.

Vector database for semantic search

Evolution of embedding models:

Early models like word2vec (2013) and GloVe created dense vectors for individual words using neural networks trained on massive text corpora. Google trained word2vec on 100 billion words. But these models had limitations. Each word only has one vector representation. This means it can’t handle words with multiple meanings. Furthermore, it can’t consider the context from the nearby words, sentences, or documents.

Modern transformer models (BERT, GPT, ELMo) changed everything. They’re context-aware, with the entire input text is considered. They’re polysemantic and can represent words with multiple meanings. They’re flexible and can vectorize text of any length.

Every piece of technology has a trade off. Transformers require much higher compute, increased memory, and latency. That said, what’s cool is that models like CLIP can vectorize images, text, and videos into the same vector space.

Similarity measurement:

Vector search uses cosine similarity. The cosine of the angle between two vectors. It ranges from -1 to 1, where 1 means identical, 0 means orthogonal (unrelated), and -1 means complete opposite. To handle scale, systems use approximate nearest neighbor (ANN) algorithms like FAISS, ScaNN, or HNSW to reduce computational cost.

Embeddings excel at semantic understanding. For instance, “automobile” and “car” are recognized as similar. They handle paraphrasing well since different phrasings of the same question work. They enable cross-lingual search across languages and are typo-robust because the attention mechanism makes them resilient to spelling errors.

Where embeddings struggle is with rare identifiers like specific case numbers, product IDs, or proper nouns. When exact term matching matters, embeddings fall short. There’s also limited interpretability. It’s hard to explain why a particular result was retrieved.

Hybrid Search: Best of both worlds?

Hybrid search combines BM25 and vector embeddings using a weighted average controlled by an alpha parameter:

Alpha = 0: Pure BM25 (keyword)
Alpha = 1: Pure embedding (semantic)
Alpha = 0.5: Equal weight

The theory is that we get keyword precision and semantic understanding. But does it work in practice?

The Experiment

I implemented all three retrieval modes and tested them on a Wikipedia article about Spanish history. I created three queries to stress-test each approach.

Query 1: Roman history in Spain

BM25: Won. Found “romana” (Roman) as an exact keyword match.

Embedding: Failed. Returned general Spain information instead of Roman-specific content.

Hybrid: Failed. Oddly returned snippets about the EU and United Nations. Totally not relevant.

Winner: BM25 ✓

Query 2: Democratic government

BM25: Gave two results with similar scores. The #2 result was most relevant, but it wasn’t ranked first.

Embedding: Won. Top result was most relevant.

Hybrid: Won. Top result was most relevant.

Winner: Embedding and Hybrid ✓

Query 3: Overseas territories

The exact phrase “overseas territories” didn’t appear, but “overseas” and “territory” appeared separately in a sentence.

BM25: Scored it highly. Top result was correct.

Embedding: Also scored it highly. Top result was correct.

Hybrid: Same top result.

Since the terms were so specific, all three approaches succeeded. Other chunks didn’t have similar concepts to confuse the embedding search.

Winner: It’s a tie! ✓

Results round 1

BM25: 2/3 correct
Embedding: 2/3 correct
Hybrid: 2/3 correct

I was not expecting that we end up in a tie! BM25 excels at keywords. Embedding excels at semantics. Hybrid tries to bridge the gap.

Testing the real use case

In HolaMundo, students oftentimes don’t use the perfect keywords. Instead, they paraphrase. I created three semantic queries in Spanish:

“¿Cuándo llegaron los romanos a España?” (When did Romans arrive in Spain?)
“¿Cómo funciona el sistema político español?” (How does the Spanish political system work?)
“¿Qué colonias tenía España?” (What colonies did Spain have?)

I expected embedding search to win due to the paraphrasing nature, so I experimented with adjusting the alpha parameter in hybrid search:

Alpha = 0.5: Balanced
Alpha = 0.7: Favoring semantic search

Results round 2 (paraphrasing queries)

Query	Alpha 0.5	Alpha 0.7	BM25 only	Embedding only
Query 1 (Romans)	✓	✓ (better)	✗	✓
Query 2 (Democracy)	✗	✗	✗	✗
Query 3 (Territories)	~partial	✗	~partial	✗

Score:

Alpha 0.5: 1/3
Alpha 0.7: 1/3
BM25 only: 0/3
Embedding only: 1/3

All methods struggled!

This is definitely not what I expected. All models performed worse with a semantic query. Ultimately, I attribute this to the smaller chunk size, and embedding doesn’t have enough words to build up its context.

Key Learnings

Search is a complex, trial-and-error process for the use case at hand

1. There’s a time and place for search.

In the case of HolaMundo, most language learners likely will read small articles (<1000 words). 1000 words is approximately 1300 tokens. Modern LLMs easily handle 8k-200k context windows. I could have skipped search entirely and send all chunks to the LLM every time, and let the LLM figure out what’s relevant. This is a much simpler design.

However, when more advanced learners send in 2000-3000 word articles, search becomes more relevant. Cost optimization is top of mind for me. If I have a large user base, sending the top relevant chunks will save significant money.

Another reason is the response quality. More irrelevant context means more “nosie” for the LLM, which can lead to hallucination and less focused answers.

2. My hypothesis was partially wrong

I thought semantic search would dominate for language learners who paraphrase. While embeddings do help with paraphrasing, they still need adequate chunk size and vocabulary overlap. Neither approach is a silver bullet.

3. Hybrid search is the safe default

In real-world applications, we can’t predict query types ahead of time. Hybrid search with alpha = 0.5 provides the best balance. Some queries need keyword precision, others need semantic understanding.

What I’m Implementing in HolaMundo

In v1 of HolaMundo, I implemented using embedding search only. Given this experiment, I’ll be switching to hybrid search with alpha = 0.5. Students ask questions in various ways — sometimes with specific keywords, sometimes paraphrasing the article. A balanced approach handles both patterns.

Closing thoughts

Search is not a one-size-fits-all problem. BM25 and vector embeddings each have strengths. Building all three systems taught me that it’s more effective to test with real queries. Ultimately, the best search system is the one that serves my users’ actual behavior. If I were to build this in a professional context, I would want to do some A/B testing important to measure which method is helping students learn, and not just ranking accuracy.

The best search system is the one that serves my users’ actual behavior, not the one with the most impressive algorithm.

Want to see hybrid search in action? Check out HolaMundo and try answering questions in Spanish.