LanceDB provides two powerful tools for query analysis and optimization:
explain_plan
: Reveals the logical query plan before execution,
helping you identify potential issues with query structure and index usage.
This tool is useful for:
analyze_plan
: Executes the query and provides detailed runtime metrics,
including:
Together, these tools offer a comprehensive view of query performance,
from planning to execution. Use explain_plan
to verify your query
structure and analyze_plan
to measure and optimize actual performance.
To demonstrate query performance analysis, we’ll use a table containing 1.2M rows sampled from the Wikipedia dataset. Initially, the table has no indices, allowing us to observe the impact of optimization.
Let’s examine a vector search query that:
identifier
is between 0 and 1,000,000chunk_index
, title
, and identifier
The execution plan reveals the sequence of operations performed to execute your query. Let’s examine each component of the plan:
Base Layer (LanceScan)
First Filter
identifier
column (will reduce the number of vectors that need
KNN computation)Vector Search
Results Processing
Data Retrieval
RemoteTake
is a key component of Lance’s I/O cachechunk_index
and title
) needed for the final outputFinal Output
This plan demonstrates a basic search without index optimizations: it performs a full scan and filter before vector search.
Let’s use analyze_plan
to run the query and analyze the query performance,
which will help us identify potential bottlenecks:
The analyze_plan
output reveals detailed performance metrics for each step of
the query execution:
Data Loading (LanceScan)
vector
and identifier
) for the initial processingFiltering & Search
identifier > 0 AND identifier < 1000000
) before vector searchResults Processing
chunk_index
and title
) were fetched for the final resultsKey observations:
Let’s create the vector index on the vector
column and the scalar index
on the identifier
columns where the filter is applied. explain_plan
and analyze_plan
now show
the following:
Let’s break down this optimized query execution plan from bottom to top:
The plan demonstrates a combined scalar index filter + vector similarity search,
optimized to first filter by identifier
before performing the ANN search.
identifier
column, only 2 indices (vector and scalar) and 562 scalar index parts were loadedWhen to Create Indices
merge_insert
Index Type Selection
Data Type | Recommended Index | Use Case |
---|---|---|
Vector | IVF_PQ/IVF_HNSW_SQ | Approximate nearest neighbor search |
Scalar | B-Tree | Range queries and sorting |
Categorical | Bitmap | Multi-value filters and set operations |
List | Label_list | Multi-label classification and filtering |
More details on indexing can be found here.
Use table.index_stats()
to monitor index coverage.
A well-optimized table should have num_unindexed_rows ~ 0
.
Common Patterns and Fixes:
Plan Pattern | Optimization |
---|---|
LanceScan with high bytes_read or iops | Add missing index |
Use select() to limit returned columns | |
Check whether the dataset has been compacted | |
Multiple sequential filters | Reorder filter conditions |
Regularly analyze your query plans to identify and address performance bottlenecks.
The analyze_plan
output provides detailed metrics to guide optimization efforts.
LanceDB provides two powerful tools for query analysis and optimization:
explain_plan
: Reveals the logical query plan before execution,
helping you identify potential issues with query structure and index usage.
This tool is useful for:
analyze_plan
: Executes the query and provides detailed runtime metrics,
including:
Together, these tools offer a comprehensive view of query performance,
from planning to execution. Use explain_plan
to verify your query
structure and analyze_plan
to measure and optimize actual performance.
To demonstrate query performance analysis, we’ll use a table containing 1.2M rows sampled from the Wikipedia dataset. Initially, the table has no indices, allowing us to observe the impact of optimization.
Let’s examine a vector search query that:
identifier
is between 0 and 1,000,000chunk_index
, title
, and identifier
The execution plan reveals the sequence of operations performed to execute your query. Let’s examine each component of the plan:
Base Layer (LanceScan)
First Filter
identifier
column (will reduce the number of vectors that need
KNN computation)Vector Search
Results Processing
Data Retrieval
RemoteTake
is a key component of Lance’s I/O cachechunk_index
and title
) needed for the final outputFinal Output
This plan demonstrates a basic search without index optimizations: it performs a full scan and filter before vector search.
Let’s use analyze_plan
to run the query and analyze the query performance,
which will help us identify potential bottlenecks:
The analyze_plan
output reveals detailed performance metrics for each step of
the query execution:
Data Loading (LanceScan)
vector
and identifier
) for the initial processingFiltering & Search
identifier > 0 AND identifier < 1000000
) before vector searchResults Processing
chunk_index
and title
) were fetched for the final resultsKey observations:
Let’s create the vector index on the vector
column and the scalar index
on the identifier
columns where the filter is applied. explain_plan
and analyze_plan
now show
the following:
Let’s break down this optimized query execution plan from bottom to top:
The plan demonstrates a combined scalar index filter + vector similarity search,
optimized to first filter by identifier
before performing the ANN search.
identifier
column, only 2 indices (vector and scalar) and 562 scalar index parts were loadedWhen to Create Indices
merge_insert
Index Type Selection
Data Type | Recommended Index | Use Case |
---|---|---|
Vector | IVF_PQ/IVF_HNSW_SQ | Approximate nearest neighbor search |
Scalar | B-Tree | Range queries and sorting |
Categorical | Bitmap | Multi-value filters and set operations |
List | Label_list | Multi-label classification and filtering |
More details on indexing can be found here.
Use table.index_stats()
to monitor index coverage.
A well-optimized table should have num_unindexed_rows ~ 0
.
Common Patterns and Fixes:
Plan Pattern | Optimization |
---|---|
LanceScan with high bytes_read or iops | Add missing index |
Use select() to limit returned columns | |
Check whether the dataset has been compacted | |
Multiple sequential filters | Reorder filter conditions |
Regularly analyze your query plans to identify and address performance bottlenecks.
The analyze_plan
output provides detailed metrics to guide optimization efforts.