
Supported distance metrics
Distance metrics determine how LanceDB compares vectors to find similar matches. Euclidean orl2 is the default, and used for general-purpose similarity, cosine for unnormalized embeddings, dot for normalized embeddings (best performance), or hamming for binary vectors.
The right metric improves both search accuracy and query performance. Currently, LanceDB supports the following metrics:
| Metric | Description | Default |
|---|---|---|
l2 | Euclidean distance - measures the straight-line distance between two points in vector space. Calculated as the square root of the sum of squared differences between corresponding vector components. | ✓ |
cosine | Cosine similarity - measures the cosine of the angle between two vectors, ranging from -1 to 1. Computed as the dot product divided by the product of vector magnitudes. Use for unnormalized vectors. | x |
dot | Dot product - calculates the sum of products of corresponding vector components. Provides raw similarity scores without normalization, sensitive to vector magnitudes. Use for normalized vectors for best performance. | x |
hamming | Hamming distance - counts the number of positions where corresponding bits differ between binary vectors. Only applicable to binary vectors stored as packed uint8 arrays. | x |
Configure Distance Metric
By default,l2 will be used as metric type. You can specify the metric type as
cosine or dot if required.
Note: You can configure the distance metric during search only if there’s no vector index. If a vector index exists, the distance metric will always be the one you specified when creating the index.
cosine similarity instead of l2 distance. The result focuses on vector direction rather than absolute distance, which works better for normalized embeddings.
Vector Search With ANN Index
Instead of performing an exhaustive search on the entire database for each and every query, approximate nearest neighbour (ANN) algorithms use an index to narrow down the search space, which significantly reduces query latency. The trade-off is that the results are not guaranteed to be the true nearest neighbors of the query, but are usually “good enough” for most use cases. Use ANN search for large-scale applications where speed matters more than perfect recall. LanceDB uses approximate nearest neighbor algorithms to deliver fast results without examining every vector in your dataset.Tuning nprobes
nprobescontrols how many partitions are searched at query time.- Higher
nprobestypically improves recall but reduces performance. - A common starting point is to choose
nprobesin the range 10-20, for balanced recall and latency. - After a certain threshold, increasing
nprobesyields only marginal accuracy gains. - LanceDB automatically chooses a sensible
nprobesby default to maximize performance without noticeably affecting accuracy.
Vector Search with Prefiltering
This is the default vector search setting. You can use prefiltering to boost query performance by reducing the search space before vector calculations begin. The system first applies your filter criteria to the dataset, then conducts vector search operations only on the remaining relevant subset..where("label > 2") applies a filter before vector search, .select(["text", "keywords", "label"]) chooses specific columns to return, and .limit(5) restricts results to the top 5 most similar vectors.
As a result, you’ll see a pandas DataFrame with just the data you want from the most similar vectors.
Vector Search with Postfiltering
Use postfiltering to prioritize vector similarity by searching the full dataset first, then applying metadata filters to the top results. This approach ensures you get the most similar vectors before filtering, which can be crucial when similarity is more important than metadata constraints.prefilter=False parameter tells LanceDB to apply the filter after vector search instead of before, .where("label > 1") filters the top results by metadata, and .select() chooses which columns to include.
In the end, you receive a pandas DataFrame with the best matches that also meet your metadata requirements.
Post-filtering in LanceDB applies
the filter condition after obtaining the nearest neighbors based on vector similarity.
Multivector Search
Use multivector search when your documents contain multiple embeddings and you need sophisticated matching between query and document vector pairs. The late interaction approach finds the most relevant combinations across all available embeddings and provides nuanced similarity scoring. Onlycosine similarity is supported as the distance metric for multivector search operations.
np.random.random(size=(2, 256)) creates a 2×256 array with two random query vectors, .limit(5) returns the top 5 best document-query combinations, and .to_pandas() provides results in a DataFrame format.
Read more: Multivector search
Advanced Search Scenarios
Search With Distance Range
Usedistance_range search when you need vectors within particular similarity bounds rather than just the closest neighbors. The system filters results to only include vectors that fall within your specified distance thresholds from the query.
distance_range() method filters results by similarity thresholds - the first example finds vectors with distance between 0.1 and 0.5, the second finds vectors closer than 0.5, and the third finds vectors farther than 0.1.
Each approach returns Arrow tables with vectors that fall within your specified distance thresholds.
Search With Binary Vectors
Use binary vector search for scenarios involving binary embeddings, such as those produced by hashing algorithms. The system stores these efficiently as packed uint8 arrays and uses Hamming distance calculations to determine vector similarity.np.random.randint(0, 2, size=256) creates binary vectors, np.packbits() compresses them to bytes, and .distance_type("hamming") specifies hamming distance for similarity calculation.
The search produces an Arrow table with binary vectors ranked by how many bits differ from the query.
Scaling Vector Search
Batch Search
Use batch search to handle multiple query vectors simultaneously. This gives you significant efficiency gains over individual queries. LanceDB processes all vectors in parallel and organizes results with aquery_index field that maps each result set back to its originating query.
load_dataset() loads embeddings from a Hugging Face dataset, query_embeds contains 5 query vectors, and .search(query_embeds) processes all queries simultaneously.
The final result is a pandas DataFrame with all results, including a query_index to tell you which query each result came from.
Search With Asynchronous Indexing
To optimize for speed over completeness, enable thefast_search flag in your query to skip searching unindexed data.
While vector indexing occurs asynchronously, newly added vectors are immediately
searchable through a fallback brute-force search mechanism. This ensures zero
latency between data insertion and searchability, though it may temporarily
increase query response times.
fast_search=True parameter tells LanceDB to only search indexed vectors, skipping any recently added data that hasn’t been indexed yet.
You’ll obtain a pandas DataFrame with the top 5 matches from indexed vectors, but might miss data that was just added.
Brute Force Search
Search With No Index
The simplest way to perform vector search is to perform a brute force search, without an index, where the distance between the query vector and all the vectors in the database are computed, with the top-k closest vectors returned. This is equivalent to a k-nearest neighbours (kNN) search in vector space. Choose brute force search when you need guaranteed 100% recall, typically with smaller datasets where query speed isn’t the primary concern. The system scans every vector in the table and calculates precise distances to find the exact nearest neighbors.
Bypass the Vector Index
Usebypass_vector_index to get exact, ground-truth results by performing exhaustive searches across all vectors. Instead of relying on approximate methods, the system directly compares your query against every vector in the table, ensuring 100% recall at the cost of increased query time.
.bypass_vector_index() method forces LanceDB to perform an exhaustive search through all vectors instead of using the approximate nearest neighbor index, ensuring exact results but at the cost of slower performance.
The outcome is a pandas DataFrame with the top 5 exact matches, guaranteeing 100% recall but taking longer to run.
This approach is particularly useful when:
- Evaluating ANN index quality
- Calculating recall metrics to tune index parameters
- Ensuring exact results for critical applications