Build an index
LanceDB provides a comprehensive suite of indexing strategies to optimize query performance across diverse workloads:
- Vector Index: Optimized for searching high-dimensional data (like images, audio, or text embeddings) by efficiently finding the most similar vectors
- Scalar Index: Accelerates filtering and sorting of structured numeric or categorical data (e.g., timestamps, prices)
- Full-Text Search Index: Enables fast keyword-based searches by indexing words and phrases
Scalar indices serve as a foundational optimization layer, accelerating filtering across diverse search workloads. They can be combined with:
- Vector search (prefilter or post-filter results using metadata)
- Full-text search (combining keyword matching with structured filters)
- SQL scans (optimizing WHERE clauses on scalar columns)
- Key-value lookups (enabling rapid primary key-based retrievals)
Compared to our open source version, LanceDB Cloud/Enterprise automates data indexing with low-latency, asynchronous indexing performed in the cloud.
- Auto-Indexing: Automates index optimization for all types of indices as soon as data is updated
- Automatic Index Creation: When a table contains a single vector column named
vector
, LanceDB:- Infers the vector column from the table schema
- Creates an optimized IVF-PQ index without manual configuration
- Parameter Autotuning: LanceDB analyzes your data distribution to automatically configure indexing parameters
Vector Index
LanceDB implements state-of-the-art indexing algorithms (IVF-PQ and HNSW) with acceleration from our optimized infrastructure. We support multiple distance metrics:
- L2 (default)
- Cosine
- Dot
- Hamming (for binary vectors only)
You can create multiple vector indices within a table.
The create_index
API returns immediately, but the building of the vector index is asynchronous. To wait until all data is fully indexed, you can specify thewait_timeout
parameter.
- If your vector column is named
vector
and contains more than 256 vectors, an IVF_PQ index with L2 distance is automatically created - You can create a new index with different parameters using
create_index
- this replaces any existing index - When using cosine similarity, distances range from 0 (identical vectors) to 2 (maximally dissimilar)
- Available index types:
IVF_PQ
: Default index type, optimized for high-dimensional vectorsIVF_HNSW_SQ
: Combines IVF clustering with HNSW graph for improved search quality
Check Index Status
Vector index creation is fast - typically a few minutes for 1 million vectors with 1536 dimensions. You can check index status in two ways:
Option 1: Dashboard
Navigate to your table page - the “Index” column shows index status. It remains blank if no index exists or if creation is in progress.
Option 2: API
Use list_indices()
and index_stats()
to check index status. The index name is formed by appending “_idx” to the column name. Note that list_indices()
only returns information after the index is fully built.
To wait until all data is fully indexed, you can specify the wait_timeout
parameter on create_index()
or call wait_for_index()
on the table.
Index Binary Vectors
Binary vectors are useful for hash-based retrieval, fingerprinting, or any scenario where data can be represented as bits.
Key points for binary vectors:
- Store as fixed-size binary data (uint8 arrays, with 8 bits per byte)
- Use Hamming distance for similarity search
- Pack binary vectors into bytes to save space
The dimension of binary vectors must be a multiple of 8. For example, a 128-dimensional vector is stored as a uint8 array of size 16.
IVF_FLAT
with Hamming distance is used for indexing binary vectors.
Scalar Index
LanceDB Cloud and Enterprise support several types of scalar indices to accelerate search over scalar columns:
- BTREE: The most common type, inspired by the btree data structure. Performs well for columns with many unique values and few rows per value.
- BITMAP: Stores a bitmap for each unique value. Ideal for columns with a finite number of unique values and many rows per value (e.g., categories, labels, tags).
- LABEL_LIST: Special index for
List<T>
columns, supportingarray_contains_all
andarray_contains_any
queries using an underlying bitmap index.
You can create multiple scalar indices within a table.
The create_scalar_index
API returns immediately, but the building of the scalar index is asynchronous.
Check scalar index status using the methods above.
Build a Scalar Index on UUID Columns
LanceDB supports scalar indices on UUID columns (stored as FixedSizeBinary(16)
), enabling efficient lookups and filtering on UUID-based primary keys.
To use FixedSizeBinary, ensure you have:
- Python SDK version 0.22.0-beta.4 or later
- TypeScript SDK version 0.19.0-beta.4 or later
Full-Text Search Index
LanceDB Cloud and Enterprise provide performant full-text search based on BM25, allowing you to incorporate keyword-based search in your retrieval solutions.
The create_fts_index
API returns immediately, but the building of the FTS index is asynchronous.
Check FTS index status using the methods above.
FTS Configuration Parameters
LanceDB supports the following configurable parameters for full-text search:
Parameter | Type | Default | Description |
---|---|---|---|
with_position | bool | True | Store token positions (required for phrase queries) |
base_tokenizer | str | ”simple” | Text splitting method: - “simple”: Split by whitespace/punctuation - “whitespace”: Split by whitespace only - “raw”: Treat as single token |
language | str | ”English” | Language for tokenization (stemming/stop words) |
max_token_length | int | 40 | Maximum token size in bytes; tokens exceeding this length are omitted from the index |
lower_case | bool | True | Convert tokens to lowercase |
stem | bool | False | Apply stemming (e.g., “running” → “run”) |
remove_stop_words | bool | False | Remove common stop words |
ascii_folding | bool | False | Normalize accented characters |
- The
max_token_length
parameter helps optimize indexing performance by filtering out non-linguistic content like base64 data and long URLs - When
with_position
is disabled, phrase queries will not work, but index size is reduced and indexing is faster ascii_folding
is useful for handling international text (e.g., “café” → “cafe”)
Full-Text Search on Array Fields
LanceDB supports full-text search on string array columns, enabling efficient keyword-based search across multiple values within a single field (e.g., tags, keywords).
Update an Index
When new data is added to a table, LanceDB Cloud automatically updates indices in the background.
To check index status, use index_stats()
to view the number of unindexed rows. This will be zero when indices are fully up-to-date.
While indices are being updated, queries use brute force methods for unindexed rows, which may temporarily increase latency. To avoid this, set fast_search=True
to search only indexed data.
GPU-based Indexing
This feature is currently only available in LanceDB Enterprise. Please contact us to enable GPU indexing for your deployment.
With GPU-powered indexing, LanceDB can create vector indices with billions of rows in a few hours.