Build an index
LanceDB provides a comprehensive suite of indexing strategies, vector, scalar, and full-text search, to optimize query performance across diverse workloads.
- Vector index: optimized for searching high-dimensional data (like images, audio, or text embeddings) by finding the most similar vectors efficiently.
- Scalar index: Used for filtering and sorting structured numeric or categorical data (e.g., timestamps, prices) efficiently.
- Full-text search index: speeds up text searches by indexing words and phrases, making queries like keyword searches much faster.
Scalar indices serve as a foundational optimization layer, accelerating filtering across diverse search workloads. They can be combined with
- vector search with prefilter or post-filter results using metadata to narrow down candidate vectors,
- full-text search to speed up text-based queries by combining keyword matching with structured filters,
- SQL scans to optimize WHERE clauses on scalar columns,
- key-value lookups to enable rapid primary key-based retrievals.
Compared to our open source version, LanceDB Cloud/Enterprise automates index optimization for all types of indices, when there are updates to the data. For vector indexing, our architecture revolutionizes traditional approaches through zero-config automation and elastic scalability:
- Automatic Column Detection: When a table contains a single vector column (e.g., vector_embedding), LanceDB:
- Infers the vector column from the table schema
- Creates optimized IVF-PQ/HNSW indexes without manual configuration
- Parameter Autotuning: LanceDB analyzes your data distribution to automatically configure indexing parameters.
Vector Index
LanceDB has implemented the state-of-art indexing algorithms (more about IVF-PQ and HNSW) with acceleration from our optimized infrastructure. We currently support the L2, Cosine , Dot and Hamming (only available for binary vectors) as distance calculation metrics. You can create multiple vector indices within a table.
create_index
API returns immediately, but the building of the vector index is asyncronous.
- If the vector column uses the default name
vector
and contains more than 256 vectors, a vector index using IVF-PQ with L2 distance metric will be automatically generated. You do not need to callcreate_index
to have the vector index created in this case. - You can always create a new index using a different index type or distance type with the
create_index
API. This new index will replace any existing index. - When cosine is selected as the distance metric, the search results will compute cosine distances, which range from 0 (identical vectors) to 2 (maximally dissimilar vectors).
Check index status
Creating a vector index is fast. It takes only a few minutes to index 1 million vectors with 1536 dimensions. Index status can be checked on the dashboard or retrieved by calling APIs.
Option 1: From the dashboard
After navigating to the desired table page, the corresponding column will display the index once it has been created. The “Index” column will remain blank if no index exists or if index creation is still in progress.
Option 2: Calling APIs
We provide the list_indices()
and index_stats()
APIs to check index status.
The index name is formed by appending “_idx” to the column name.
Note that list_indices()
will not return any information until the index has
fully ingested and indexed all available data.
Index binary vectors
Binary vectors are useful for applications that use hash-based retrieval, fingerprinting, or any scenario where data can be represented as bits.
When using binary vectors:
- Store them as fixed-size binary data - they are stored as uint8 arrays (every 8 bits are stored as a byte)
- Use Hamming distance for similarity search
- Pack binary vectors into bytes to save space
The dim of the binary vector must be a multiple of 8. A vector of dim 128 will be stored as a uint8 array of size 16.
Currently, binary vector search is only support in our Python SDK. LanceDB’s Typescript SDK support for binary vector is coming soon.
IVF_FLAT
with Haming distance is used for indexing binary vectors.
Scalar Index
LanceDB Cloud and LanceDB Enterprise supports several types of Scalar indices to accelerate search over scalar columns.
- BTREE: The most common type is BTREE. This index is inspired by the btree data structure although only the first few layers of the btree are cached in memory. It will perform well on columns with a large number of unique values and few rows per value.
- BITMAP: this index stores a bitmap for each unique value in the column. This index is useful for columns with a finite number of unique values and many rows per value. For example, columns that represent “categories”, “labels”, or “tags”
- LABEL_LIST: a special index that can be used on
List<T>
columns to support queries witharray_contains_all
andarray_contains_any
using an underlying bitmap index. For example, a column that contains lists of tags (e.g. [“tag1”, “tag2”, “tag3”]) can be indexed with a LABEL_LIST index.
You can create multiple scalar indices within a table.
create_scalar_index
API returns immediately, but the building of the scalar index is asyncronous.
You can check the scalar index status by using the two options above.
Full-Text Search Index
We provide performant full-text search on LanceDB Cloud and LanceDB Enterprise, allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions.
create_fts_index
API returns immediately, but the building of the FTS index is asyncronous.
FTS Configuration Parameters
LanceDB supports the following configurable parameters for creating a full-text search index:
Parameter | Type | Default | Description |
---|---|---|---|
with_position | bool | True | Whether to store token positions in the document. Disabling this reduces index size and speeds up indexing but disables phrase query support. |
base_tokenizer | str | ”simple” | Base tokenizer for text splitting. Options: - “simple”: Split by whitespace/punctuation. - “whitespace”: Split by whitespace only. - “raw”: Treat entire text as a single token. |
language | str | ”English” | Language used for tokenization (e.g., for stemming or stop-word removal). |
max_token_length | int | 40 | Maximum token length to index. Tokens longer than this are ignored. |
lower_case | bool | True | Convert tokens to lowercase, enabling case-insensitive queries. |
stem | bool | False | Reduce tokens to their root form via stemming (e.g., “running” → “run”). |
remove_stop_words | bool | False | Remove common stop words (e.g., “the”, “and”) during tokenization. |
ascii_folding | bool | False | Normalize accented characters to ASCII equivalents (e.g., “café” → “cafe”). |
You can check the FTS index status by using the two options above.
Update an Index
When new data is added to a table, LanceDB Cloud automatically updates indices in the background.
To see the status of an index, call the index_stats()
API to view the number of rows
that are currently unindexed. This will be zero when indices are fully up-to-date.
When there are unindexed rows in a table, queries will use brute force methods to search
over the unindexed rows. This can mean a temporary increase in latency while the background
indexing jobs are running. To avoid this, you can search only the data currently indexed
by setting the fast_search
query parameter to True
.
GPU based Indexing
This feature currently is only available in LanceDB Enterprise. Please Contact us to enable GPU for indexing for your deployment.
With GPU powered indexing, LanceDB is able to create a vector index with billions of rows in a few hours.