Ingesting Wikipedia into LanceDB for Scalable Vector Search
Focusing on LanceDB: Learn how to ingest large datasets like Wikipedia (41M rows), define schemas, add data, and create indexes for vector search.
➡️ Try the Live Wikipedia Search App Here! ⬅️
Introduction
This guide demonstrates the steps required to ingest a large text corpus, like the Wikipedia 41M sample dataset, for efficient semantic search and RAG. We will cover:
- Connecting to LanceDB
- Defining your data structure
- Adding data in batches
- Creating vector indexes for fast search
To use it yourself, you can visit this repo and follow the instructions.
Focus: This document details the crucial LanceDB setup and operations. While the full example uses Modal for massive parallelization, the LanceDB techniques shown are fundamental and can be applied within any infrastructure you choose for scaling (Ray, Spark, etc.).
Performance Metrics
When running this workflow on Modal with 50 GPUs:
- Ingestion: Complete 41M records in ~11 minutes
- Indexing: Vector index creation in ~30 minutes
By storing Wikipedia chunks and their embeddings in LanceDB, you create a powerful retriever for:
- Semantic search applications
- Retrieval-Augmented Generation (RAG) pipelines
- Knowledge discovery tools
Prerequisites
Before starting, ensure you have:
- LanceDB python package
- Other required Libraries
- Dataset: Access to the Wikipedia dataset.
- Embedding Model: A Sentence Transformer model.
These are the implementation steps.
1. Connecting to LanceDB
Establish a connection to your LanceDB database URI. You’ll need to create a project on LanceDB cloud. You can visit cloud.lancedb.com to create a new project and get the project uri and api key to initialize a connection.
vector
: Stores the embedding. Ensure list_size matches your VECTOR_DIM.Metadata
: Include fields useful for filtering searches or providing context later.
3. Creating (or Opening) the Table
Use db.create_table
to initialize your table with the defined schema. It’s good practice to handle cases
where the table might already exist in case you’re running ingestion operation in distributed fashion.
4. Adding Data in Batches
Ingest data efficiently using table.add()
. Prepare data matching the schema (including the computed embedding vector) and add it in reasonably sized batches.
For this we used batches of 200K rows at once.
-
Batching (table.add(list_of_dicts)) is much faster than adding records individually. Adjust BATCH_SIZE_LANCEDB based on memory and performance.
5. Creating a vector index
Once the index creation is done, you’ll see these labels appear on the columns with index.
To wait until all data is fully indexed, you can specify the wait_timeout parameter on create_index() or call wait_for_index() on the table.
- Indexing is crucial for query performance.
The core pattern is: parallelize data loading, chunking, and embedding generation, then use table.add(batch) within each parallel worker to write to LanceDB. LanceDB’s design efficiently handles these concurrent additions. This example uses modal for performing distributed embedding generation and ingestion.
For brevity other common steps like embedding and normalization have not been covered, but you can follow along the steps to reproduce on the github repo for a complete implementation including:
- Distributed embedding generation
- Data preprocessing
- Error handling
- Performance optimization