explain_plan and analyze_plan. Let’s take a better look at how they work:
| Method | Purpose | Description |
|---|---|---|
explain_plan | Query Analysis | Print the resolved query plan to understand how the query will be executed. Helpful for identifying slow queries or unexpected query results. |
analyze_plan | Performance Tuning | Execute the query and return a physical execution plan annotated with runtime metrics including execution time, number of rows processed, and I/O stats. Essential for performance tuning and debugging. |
Query Analysis Tools
explain_plan
Reveals the logical query plan before execution, helping you identify potential issues with query structure and index usage. This tool is useful for:- Verifying query optimization strategies
- Validating index selection
- Understanding query execution order
- Detecting missing indices
analyze_plan
Executes the query and provides detailed runtime metrics, including:- Operation duration (
_elapsed_compute_) - Data processing statistics (
_output_rows_,_bytes_read_) - Index effectiveness (
_index_comparisons_,_indices_loaded_) - Resource utilization (
_iops_,_requests_)
explain_plan to verify your query structure and analyze_plan to measure and optimize actual performance.
Reading the Execution Plan
To demonstrate query performance analysis, we’ll use a table containing 1.2M rows sampled from the Wikipedia dataset. Initially, the table has no indices, allowing us to observe the impact of optimization. Let’s examine a vector search query that:- Filters rows where
identifieris between 0 and 1,000,000 - Returns the top 100 matches
- Projects specific columns:
chunk_index,title, andidentifier
Execution Plan Components
The execution plan reveals the sequence of operations performed to execute your query. Let’s examine each component:1. Base Layer (LanceScan)
- Initial data scan loading only specified columns to minimize I/O
- Unordered scan enabling parallel processing
2. First Filter
- Apply requested filter on
identifiercolumn - Reduces the number of vectors that need KNN computation
3. Vector Search
- Computes L2 (Euclidean) distances between query vector and all vectors that passed the filter
4. Results Processing
- Filters out null distance results
- Sorts by distance and takes top 100 results
- Processes in batches of 1024 for optimal memory usage
5. Data Retrieval
RemoteTakeis a key component of Lance’s I/O cache- Handles efficient data retrieval from remote storage locations
- Fetches specific rows and columns needed for the final output
- Optimizes network bandwidth by only retrieving required data
6. Final Output
- Returns only requested columns and maintains column ordering
Performance Analysis
Let’s useanalyze_plan to run the query and analyze the query performance, which will help us identify potential bottlenecks:
Performance Metrics Analysis
1. Data Loading (LanceScan)
- Scanned 1,200,000 rows from the LanceDB table
- Read 1.86GB of data in 78 I/O operations
- Only loaded necessary columns (
vectorandidentifier) - Unordered scan for parallel processing
2. Filtering & Search
- Applied prefilter condition (
identifier > 0 AND identifier < 1000000) - Reduced dataset from 1.2M to 1,099,508 rows
- KNN search used L2 (Euclidean) distance metric
- Vector comparisons processed in 1076 batches
3. Results Processing
- KNN results sorted by distance (TopK with fetch=100)
- Null distances filtered out
- Batches coalesced to target size of 1024 rows
- Additional columns fetched for final results
- Remote take operation for 100 results
- Final projection of required columns
Key Observations
- Vector search is the primary bottleneck (1,099,508 vector comparisons)
- Significant I/O overhead (1.86GB data read)
- Full table scan due to lack of indices
- Substantial optimization potential through proper index implementation
Optimized Query Execution
After creating vector and scalar indices, the execution plan shows:Optimized Plan Analysis
1. Scalar Index Query
- Range filter using scalar index
- Only 2 index files and 562 scalar index parts loaded
- 2.3M index comparisons for matches
2. Vector Search
- IVF index with 20 probes
- Only 20 index parts loaded
- 25,893 vector comparisons
- 2,000 matching vectors
3. Results Processing
- Sorts by distance
- Limits to top 100 results
- Batches into groups of 1024
4. Data Fetching
- Single output batch
- One remote take per row
5. Final Projection
- Returns specified columns: chunk_index, title, identifier, and distance
Performance Improvements
1. Initial Data Access
- Before: Full table scan of 1.2M rows, 1.86GB data
- After: Only 2 indices and 562 scalar index parts loaded
- Benefit: Eliminated table scans for prefilter
2. Vector Search Efficiency
- Before: L2 calculations on 1,099,508 vectors
- After:
- 99.8% reduction in vector comparisons
- Decreased output batches from 1,076 to 20
3. Data Retrieval Optimization
- RemoteTake operation remains consistent
Performance Optimization Guide
1. Index Implementation
When to Create Indices
- Columns used in WHERE clauses
- Vector columns for similarity searches
- Join columns used in
merge_insert
Index Type Selection
| Data Type | Recommended Index | Use Case |
|---|---|---|
| Vector | IVF_PQ/IVF_HNSW_SQ | Approximate nearest neighbor search |
| Scalar | B-Tree | Range queries and sorting |
| Categorical | Bitmap | Multi-value filters and set operations |
List | Label_list | Multi-label classification and filtering |
Index Coverage Monitoring
Use
table.index_stats() to monitor index coverage.
A well-optimized table should have num_unindexed_rows ~ 0.2. Query Plan Optimization
Common Patterns and Fixes
| Plan Pattern | Optimization |
|---|---|
| LanceScan with high bytes_read or iops | Add missing index |
Use select() to limit returned columns | |
| Check whether the dataset has been compacted | |
| Multiple sequential filters | Reorder filter conditions |
analyze_plan output provides detailed metrics to guide optimization efforts.
3. Getting Started with Optimization
For vector search performance:- Create ANN index on your vector column(s) as described in the index guide
- If you often filter by metadata, create scalar indices on those columns