Option 1: Self-Hosted Indexing
Manual, Sync or Async: If using LanceDB Open Source, you will have to build indexes manually, as well as reindex and tune indexing parameters. The Python SDK lets you do this sychronously and asychronously.Option 2: Automated Indexing
Automatic and Async: Indexing is automatic in LanceDB Cloud/Enterprise. As soon as data is updated, our system automates index optimization. This is done asychronously. Here is what happens in the background - when a table contains a single vector column namedvector, LanceDB automatically:
- Infers the vector column from the schema
- Creates an optimized
IVF_PQindex without manual configuration - The default distance is
l2or euclidean
You can create a new index with different parameters using
create_index - this replaces any existing indexAlthough the create_index API returns immediately, the building of the vector index is asynchronous. To wait until all data is fully indexed, you can specify the wait_timeout parameter.Example: Construct an IVF Index
In this example, we will create an index for a table containing 1536-dimensional vectors. The index will use IVF_PQ with L2 distance, which is well-suited for high-dimensional vector search. Make sure you have enough data in your table (at least a few thousand rows) for effective index training.Index Configuration
Sometimes you need to configure the index beyond default parameters:- Index Types:
IVF_PQ: Default index type, optimized for high-dimensional vectorsIVF_HNSW_SQ: Combines IVF clustering with HNSW graph for improved search quality
metrics: default isl2, other available arecosineordot- When using
cosinesimilarity, distances range from 0 (identical vectors) to 2 (maximally dissimilar)
- When using
num_partitions: The number of partitions in the IVF portion of the index. This number is usually chosen to target a particular number of vectors per partition. A common heuristic isnum_rows / 8192. Larger values generally make index building take longer but use less memory, and they often improve accuracy at the cost of slower search because queries typically need a highernprobes. LanceDB automatically selects a sensible defaultnum_partitionsbased on the heuristic mentioned above.num_sub_vectors: The number of sub-vectors that will be created during Product Quantization (PQ). This number is typically chosen based on the desired recall and the dimensionality of the vector. Largernum_sub_vectorsincreases accuracy but can significantly slow queries; a good starting point isdimension / 8.
1. Setup
Connect to LanceDB and open the table you want to index.2. Construct an IVF Index
Create anIVF_PQ index with cosine similarity. Specify vector_column_name if you use multiple vector columns or non-default names. By default LanceDB uses Product Quantization; switch to IVF_SQ for scalar quantization.
3. Query the IVF Index
Search using a random 1,536-dimensional embedding.Search Configuration
The previous query uses:limit: number of results to returnnprobes: number of IVF partitions to scan; covering roughly 5–10% of partitions often balances recall and latencyrefine_factor: reads additional candidates and reranks in memory.to_pandas(): converts the results to a pandas DataFrame
Example: Construct an HNSW Index
Index Configuration
There are three key parameters to set when constructing an HNSW index:metric: The default isl2euclidean distance metric. Other available aredotandcosine.m: The number of neighbors to select for each vector in the HNSW graph.ef_construction: The number of candidates to evaluate during the construction of the HNSW graph.
1. Construct an HNSW Index
2. Query the HNSW Index
Example: Construct a Binary Vector Index
Binary vectors are useful for hash-based retrieval, fingerprinting, or any scenario where data can be represented as bits.Index Configuration
- Store binary vectors as fixed-size binary data (uint8 arrays, with 8 bits per byte). For storage, pack binary vectors into bytes to save space.
- Index Type:
IVF_FLATis used for indexing binary vectors metric: thehammingdistance is used for similarity search- The dimension of binary vectors must be a multiple of 8. For example, a 128-dimensional vector is stored as a uint8 array of size 16.
1. Create Table and Schema
2. Generate and Add Data
3. Construct the Binary Index
4. Vector Search
Check Index Status
Vector index creation is fast - typically a few minutes for 1 million vectors with 1536 dimensions. You can check index status in two ways:Option 1: Check the UI
Navigate to your table page - the “Index” column shows index status. It remains blank if no index exists or if creation is in progress.Option 2: Use the API
Uselist_indices() and index_stats() to check index status. The index name is formed by appending “_idx” to the column name. Note that list_indices() only returns information after the index is fully built.
To wait until all data is fully indexed, you can specify the wait_timeout parameter on create_index() or call wait_for_index() on the table.