> ## Documentation Index > Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt > Use this file to discover all available pages before exploring further. # Why LanceDB for Training > Use LanceDB as the multimodal data layer for model training, fine-tuning, curation, and feature engineering workflows. LanceDB is built for AI teams that need a practical data layer between raw multimodal datasets and model training. Instead of moving data through separate systems for curation, feature engineering, search, manifests, and training, you can keep the whole workflow attached to one versioned LanceDB table. That table can hold images, video, audio, text, annotations, metadata, embeddings, tokenized fields, model outputs, quality signals, and training-ready tensors. As the dataset evolves, LanceDB lets you add new columns, filter rows, pin versions, and read batches without rewriting the original data. Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training

Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training

LanceDB gives these stages one platform, so curation, feature engineering, retrieval, and training stay connected. ## A connected data lifecycle Training pipelines usually need more than a pile of files. They need curation, derived features, reproducible splits, fast random access, and a clean path into frameworks such as PyTorch. LanceDB keeps these pieces connected through the same table model, whether you organize a workflow as one table or several related tables. Use filters, vector search, full-text search, and retrieval workflows to find the examples that matter: hard negatives, long-tail failure modes, duplicate clusters, low-quality samples, or targeted fine-tuning slices. Add embeddings, detections, OCR output, labels, token IDs, hidden states, deduplication flags, or quality scores as new columns. Lance's columnar layout and schema evolution avoid rewriting large raw media columns when you add features. Build filtered splits and materialized views from the table instead of exporting CSV manifests. Data versions and tags make it possible to tie a checkpoint back to the exact rows and features used for training. Use fast random access and column projection to read only the columns a training step needs. LanceDB tables can be read from local storage or object storage, and integrate with data loading patterns such as PyTorch datasets. ## Lance as the foundation LanceDB is built on [Lance](https://lance.org/), an open-source lakehouse format designed for multimodal AI data. The table below highlights the Lance features that enable the multimodal lakehouse on top. | Capability | Why it matters for training | | ------------------------ | ------------------------------------------------------------------------------------- | | **Multimodal columns** | Store raw bytes, annotations, metadata, embeddings, and features together. | | **Fast random access** | Support shuffled and sampled reads without reshuffling the dataset on disk. | | **Column projection** | Read only images, tokens, labels, embeddings, or hidden states needed by a given run. | | **Schema evolution** | Add new feature columns without rewriting existing media columns. | | **Versioning** | Reproduce experiments against the same table snapshot, even as the dataset evolves. | | **Search and filtering** | Find and materialize useful training slices directly from the table. | ## Search inside training workflows Search is not limited to QA systems, agents, or production retrieval apps. It is also a practical way to inspect, curate, and improve training data: * Find visually similar examples when debugging model failures. * Retrieve hard negatives or near-duplicates for contrastive training. * Combine vector search, full-text search, and metadata filters to build targeted fine-tuning slices. * Reuse the same table for both offline curation and production retrieval. In LanceDB, retrieval and training workflows can operate over the same multimodal tables instead of forcing teams to manage separate data systems for each stage. ## Projects using LanceDB for training workflows A platform for reproducible world-model research built on a LanceDB data layer, reporting faster data loading on Push-T workloads. A joint-embedding predictive world model from pixels, trained on the stable-worldmodel platform and its LanceDB data layer. A drop-in LanceDB backend for Hugging Face LeRobot datasets with faster loading across robotics datasets. In the world-model ecosystem, [stable-worldmodel](https://github.com/galilai-group/stable-worldmodel) reports 3-4x faster data loading on Push-T versus HDF5 / MP4 at a fraction of the disk footprint. Across these projects, LanceDB and Lance provide the multimodal data layer that keeps raw observations, annotations, features, and training access patterns in one format instead of scattering them across task-specific stores. ## Next steps Learn how to use LanceDB permutations to select rows, project columns, split datasets, and shuffle training reads. Use LanceDB tables and permutations with `torch.utils.data.DataLoader`. Fine-tune an AV perception model on curated failure-mode slices backed by one LanceDB table. Fine-tune a VLM on TextVQA using LanceDB and Geneva to cache expensive training features.