Understanding Query Performance
LanceDB provides two powerful tools for query analysis and optimization:
-
explain_plan
: Reveals the logical query plan before execution, helping you identify potential issues with query structure and index usage. This tool is useful for:- Verifying query optimization strategies
- Validating index selection
- Understanding query execution order
- Detecting missing indices
-
analyze_plan
: Executes the query and provides detailed runtime metrics, including:- Operation duration (elapsed_compute)
- Data processing statistics (output_rows, bytes_read)
- Index effectiveness (index_comparisons, indices_loaded)
- Resource utilization (iops, requests)
Together, these tools offer a comprehensive view of query performance,
from planning to execution. Use explain_plan
to verify your query
structure and analyze_plan
to measure and optimize actual performance.
Reading the Execution Plan
To demonstrate query performance analysis, we’ll use a table containing 1.2M rows sampled from the Wikipedia dataset. Initially, the table has no indices, allowing us to observe the impact of optimization.
Let’s examine a vector search query that:
- Filters rows where
identifier
is between 0 and 1,000,000 - Returns the top 100 matches
- Projects specific columns:
chunk_index
,title
, andidentifier
The execution plan reveals the sequence of operations performed to execute your query. Let’s examine each component of the plan:
-
Base Layer (LanceScan)
- Initial data scan loading only specified columns to minimize I/O
- Unordered scan enabling parallel processing
-
First Filter
- Apply requested filter on
identifier
column (will reduce the number of vectors that need KNN computation)
- Apply requested filter on
-
Vector Search
- Computes L2 (Euclidean) distances between query vector and all vectors that passed the filter
-
Results Processing
- Filters out null distance results, which occur when vectors are NULL or when using cosine distance metric with zero vectors.
- Sorts by distance and takes top 100 results
- Processes in batches of 1024 for optimal memory usage
-
Data Retrieval
RemoteTake
is a key component of Lance’s I/O cache- Handles efficient data retrieval from remote storage locations
- Fetches specific rows and columns (e.g.,
chunk_index
andtitle
) needed for the final output - Optimizes network bandwidth by only retrieving required data
-
Final Output
- Returns only requested columns and maintains column ordering
This plan demonstrates a basic search without index optimizations: it performs a full scan and filter before vector search.
Performance Analysis
Let’s use analyze_plan
to run the query and analyze the query performance,
which will help us identify potential bottlenecks:
The analyze_plan
output reveals detailed performance metrics for each step of
the query execution:
-
Data Loading (LanceScan)
- Scanned 1,200,000 rows from the LanceDB table
- Read 1.86GB of data in 78 I/O operations
- Only loaded the necessary columns (
vector
andidentifier
) for the initial processing - The scan was unordered, allowing for parallel processing
-
Filtering & Search
- Applied a prefilter condition (
identifier > 0 AND identifier < 1000000
) before vector search - This reduced the dataset from 1.2M to 1,099,508 rows, optimizing the subsequent KNN computation
- The KNN search used L2 (Euclidean) distance metric
- Vector comparisons were processed in 1076 batches for efficient memory & cache usage
- Applied a prefilter condition (
-
Results Processing
- The KNN results were sorted by distance (TopK with fetch=100)
- Null distances were filtered out to ensure result quality
- Batches were coalesced to a target size of 1024 rows for optimal memory usage
- Additional columns (
chunk_index
andtitle
) were fetched for the final results - The remote take operation fetched these additional columns for each of the 100 results
- Final projection selected only the required columns for output
Key observations:
- The vector search operation is the primary performance bottleneck, requiring KNN computation across 1,099,508 vectors
- Significant I/O overhead with 1.86GB of data read through multiple I/O requests
- Query execution involves a full table scan due to lack of indices
- Substantial optimization potential through proper index implementation
Let’s create the vector index on the vector
column and the scalar index
on the identifier
columns where the filter is applied. explain_plan
and analyze_plan
now show
the following:
Let’s break down this optimized query execution plan from bottom to top:
- Scalar Index Query (Bottom Layer):
- Performs range filter using scalar index, only 2 index files and 562 sclar index parts are loaded.
- Made 2.3M index comparisons to find matches
- Vector Search:
- Uses IVF index with 20 probes, only needed to load 20 index parts
- Made 25,893 vector comparisons
- Found 2,000 matching vectors
- Results Processing:
- Sorts results by distance
- Limits to top 100 results
- Batches results into groups of 1024
- Data Fetching:
- Consolidates into just 1 output batches, one remote take per row.
- Final Projection:
- Returns only the specified columns: chunk_index, title, identifier, and distance
The plan demonstrates a combined scalar index filter + vector similarity search,
optimized to first filter by identifier
before performing the ANN search.
Key Improvements:
- Initial Data Access
- Before optimization: Full table scan of 1.2M rows, reading 1.86GB of data
- After optimization: With scalar index on the
identifier
column, only 2 indices (vector and scalar) and 562 scalar index parts were loaded - Benefit: Eliminated the need for any table scans to load the prefilter, significantly reducing I/O operations
- Vector Search Efficiency
- Before optimization: L2 distance calculations performed on 1,099,508 vectors
- After optimization:
- Reduced vector comparisons by 99.8% (from ~1.1M to 2K)
- Decreased output batches from 1,076 to 20
- Data Retrieval Optimization
- RemoteTake operation remains the same.
Performance Optimization Tips
- Index Implementation Guide
When to Create Indices
- Columns used in WHERE clauses
- Vector columns for similarity searches
- Join columns used in
merge_insert
Index Type Selection
Data Type | Recommended Index | Use Case |
---|---|---|
Vector | IVF_PQ/IVF_HNSW_SQ | Approximate nearest neighbor search |
Scalar | B-Tree | Range queries and sorting |
Categorical | Bitmap | Multi-value filters and set operations |
List | Label_list | Multi-label classification and filtering |
More details on indexing can be found here.
Use table.index_stats()
to monitor index coverage.
A well-optimized table should have num_unindexed_rows ~ 0
.
- Query Plan Optimization
Common Patterns and Fixes:
Plan Pattern | Optimization |
---|---|
LanceScan with high bytes_read or iops | Add missing index |
Use select() to limit returned columns | |
Check whether the dataset has been compacted | |
Multiple sequential filters | Reorder filter conditions |
Regularly analyze your query plans to identify and address performance bottlenecks.
The analyze_plan
output provides detailed metrics to guide optimization efforts.