Evaluating Hybrid Search Performance

Hybrid search is an often misused and/or misunderstood term. In this section, we’re using the definition of “hybrid search” to mean using a combination of keyword-based and vector search. Because the vector search operates in a dense embedding space and keyword-based search operate in a sparse embedding space, their relevance scores cannot be directly compared. Combining results from multiple searches thus requires a reranking step.

Reranking strategies

There are two common approaches for reranking search results from multiple sources.

Score-based: Calculate final relevance scores based on a weighted linear combination of individual search algorithm scores. Example: Weighted linear combination of semantic search & keyword-based search results.
Relevance-based: Discards the existing scores and calculates the relevance of each search result-query pair. Example: Cross Encoder models

Even though there may many more strategies for reranking, there are no “universally best” ones that work well for all cases, because they be dataset or application specific. Evaluating whether a reranking strategy is a good one, is also a challenge. In the next section, we discuss an example evaluation of different reranking strategies on a sample dataset.

Example evaluation

The table below shows our evaluation results from an experiment comparing multiple rerankers on ~800 hybrid search queries. This is a modified version of an evaluation script by LlamaIndex that measures hit-rate @ top-k.

Using OpenAI `text-embedding-ada-002`

Vector Search baseline: 0.64

Reranker	Top-3	Top-5	Top-10
Linear Combination	`0.73`	`0.74`	`0.85`
Cross Encoder	`0.71`	`0.70`	`0.77`
Cohere	`0.81`	`0.81`	`0.85`
ColBERT	`0.68`	`0.68`	`0.73`

Using OpenAI `text-embedding-3-small`

Vector Search baseline: 0.59

Reranker	Top-3	Top-5	Top-10
Linear Combination	`0.68`	`0.70`	`0.84`
Cross Encoder	`0.72`	`0.72`	`0.79`
Cohere	`0.79`	`0.79`	`0.84`
ColBERT	`0.70`	`0.70`	`0.76`

Conclusion

The results show that the reranking methods can significantly improve the search relevance. However, the improvement we saw was not consistent across all rerankers. In reality, the choice of reranker likely depends on the dataset and the application. It’s also important to note that the reranking methods are not a replacement for the search methods they supplement. They are complementary and it’s likely that you’d have to tune them together to get the best results. The latency vs. recall tradeoff is also an important factor to consider when choosing the reranker. Hopefully this evaluation gives you a starting point for your own experiments with hybrid search in LanceDB!

Get started

Guides

Feature Engineering (Geneva)

Support

Evaluating Hybrid Search Performance

Reranking strategies

Example evaluation

Using OpenAI `text-embedding-ada-002`

Using OpenAI `text-embedding-3-small`

Conclusion

Get started

Guides

Feature Engineering (Geneva)

Support

​Reranking strategies

​Example evaluation

​Using OpenAI text-embedding-ada-002

​Using OpenAI text-embedding-3-small

​Conclusion

Reranking strategies

Example evaluation

Using OpenAI `text-embedding-ada-002`

Using OpenAI `text-embedding-3-small`

Conclusion