TheDocumentation Index
Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
Use this file to discover all available pages before exploring further.
lance-format organization on Hugging Face publishes a growing
catalog of multimodal datasets in Lance format. Each one bundles the raw data (images, audio, video, or text),
pre-computed embeddings, and on-disk vector / full-text indices as first-class columns in the same dataset —
so vector search, full-text search, and filtered scans work directly via hf:// URIs without downloading.
This is powered under the hood by the Lance format’s native Hugging Face integration
(via the pylance library). LanceDB sits on top of Lance and gives you a
convenient table-style interface to query these datasets straight from the Hub:
Image Classification
MNIST
lance-format/mnist-lance — A Lance-formatted version of the classic MNIST handwritten-digit dataset with 70,000 28×28 grayscale digits stored inline alongside CLIP image embeddings and a pre-built ANN index.CIFAR-10
lance-format/cifar10-lance — A Lance-formatted version of CIFAR-10 with 60,000 32×32 RGB images across 10 classes, stored inline with CLIP embeddings and a pre-built IVF_PQ ANN index.Fashion-MNIST
lance-format/fashion-mnist-lance — A Lance-formatted version of Fashion-MNIST with 70,000 28×28 grayscale clothing images stored inline alongside CLIP embeddings and a pre-built IVF_PQ ANN index.Food-101
lance-format/food101-lance — Lance-formatted version of Food-101 — 101,000 food photographs across 101 classes — sourced from ethz/food101. Inline JPEG bytes + CLIP image embeddings + IVF_PQ.Oxford-IIIT Pet
lance-format/oxford-pets-lance — Lance-formatted version of the Oxford-IIIT Pet dataset — 7,390 cat & dog photos across 37 breeds — sourced from pcuenq/oxford-pets.Stanford Cars
lance-format/stanford-cars-lance — Lance-formatted version of the Stanford Cars dataset — 8,144 training images across 196 fine-grained car make/model/year classes — sourced from Multimodal-Fatima/StanfordCars_train.ImageNet-1k Validation
lance-format/imagenet-1k-val-lance — A Lance-formatted version of the canonical 50,000-image ImageNet-1k validation split (also known as ILSVRC2012 val) sourced from benjamin-paine/imagenet-1k. All 50 k JPEGs are stored inline alongside CLIP embeddings and a pre-built IVF_PQ ANN index.EuroSAT
lance-format/eurosat-lance — Lance-formatted version of EuroSAT — Sentinel-2 satellite imagery (RGB) covering 27,000 64×64 tiles across 10 land-cover classes, sourced from blanchon/EuroSAT_RGB.Object Detection & Segmentation
COCO 2017 Detection
lance-format/coco-detection-2017-lance — Lance-formatted version of the COCO 2017 object detection benchmark — sourced from detection-datasets/coco — with 123,287 images and the full per-image list of bounding boxes, category labels, and CLIP image embeddings, all stored inline.Pascal VOC 2012 Segmentation
lance-format/pascal-voc-2012-segmentation-lance — A Lance-formatted version of the Pascal VOC 2012 semantic segmentation split (sourced from nateraw/pascal-voc-2012) — 2,913 image / mask pairs with CLIP image embeddings stored inline and a pre-built IVF_PQ ANN index.ADE20K
lance-format/ade20k-lance — Lance-formatted version of the full ADE20K scene parsing benchmark (sourced from 1aurent/ADE20K) — 27,574 scene images with semantic and instance segmentation maps, scene labels, and per-object metadata, all stored inline.KITTI 2D Detection
lance-format/kitti-2d-detection-lance — Lance-formatted version of the KITTI 2D Object Detection benchmark — 7,481 training images from the KITTI Vision Benchmark Suite with 2D bounding boxes plus the full 3D-box / observation-angle metadata. Sourced from nateraw/kitti so no manual…Image Retrieval
COCO Captions 2017
lance-format/coco-captions-2017-lance — Lance-formatted version of the COCO Captions 2017 corpus, redistributed via lmms-lab/COCO-Caption2017. Each row is one image with 5–7 human-written captions, CLIP image embedding, and CLIP text embedding of the canonical caption — all stored inline.Flickr30k
lance-format/flickr30k-lance — Lance-formatted version of Flickr30k (re-distributed via lmms-lab/flickr30k) — 31,783 images, each paired with 5 human-written captions, with CLIP image and text embeddings stored inline and pre-built ANN indices on both.LAION-1M
lance-format/laion-1m — A lance dataset of LAION image-text corpus (~1M rows) with inline JPEG bytes, CLIP embeddings (img_emb), and full metadata available directly from the Hub: hf://datasets/lance-format/laion-1m/data/train.lance.Visual Question Answering
ChartQA
lance-format/chartqa-lance — Lance-formatted version of ChartQA — VQA over scientific and business charts that combine logical and visual reasoning — sourced from lmms-lab/ChartQA.DocVQA
lance-format/docvqa-lance — Lance-formatted version of DocVQA — VQA over document images (industry / government scans, multi-page reports, forms, receipts) — sourced from lmms-lab/DocVQA (DocVQA config).TextVQA
lance-format/textvqa-lance — Lance-formatted version of TextVQA — VQA where the question requires reading text in the image — sourced from lmms-lab/textvqa.VQAv2
lance-format/vqav2-lance — Lance-formatted version of VQAv2 — Visual Question Answering on COCO images, sourced from lmms-lab/VQAv2. Each row is a (image, question, 10 answers) triple with two CLIP embeddings (image + question text) so the same dataset supports both visual…GQA testdev-balanced
lance-format/gqa-testdev-balanced-lance — Lance-formatted version of the canonical GQA testdev_balanced slice — 12,578 compositional VQA questions joined with the matching 398 images — sourced from lmms-lab/GQA.Text QA
SQuAD v2
lance-format/squad-v2-lance — Lance-formatted version of SQuAD v2 — Stanford Question Answering Dataset, version 2 — with MiniLM sentence embeddings stored inline alongside the questions, contexts, and answers.TriviaQA
lance-format/trivia-qa-lance — Lance-formatted version of TriviaQA (rc.nocontext config) — a question-answering dataset of trivia questions paired with answer aliases — with MiniLM sentence embeddings stored inline.HotpotQA distractor
lance-format/hotpotqa-distractor-lance — Lance-formatted version of HotpotQA — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs — using the distractor config (10 candidate paragraphs per question, including gold + 8…Natural Questions Validation
lance-format/natural-questions-val-lance — Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries with their full Wikipedia articles and 1–5 annotator labels per question. Sourced from google-research-datasets/natural_questions.MS MARCO v2.1
lance-format/ms-marco-v2.1-lance — Lance-formatted version of MS MARCO v2.1 — Microsoft’s machine reading comprehension benchmark — with MiniLM query embeddings stored inline alongside the candidate passages and human-written answers.Text Corpora
FineWeb-Edu
lance-format/fineweb-edu — FineWeb-edu dataset with over 1.5 billion rows. Each passage ships with cleaned text, metadata, and 384-dim text embeddings for retrieval-heavy workloads.Speech
LibriSpeech clean
lance-format/librispeech-clean-lance — Lance-formatted version of the LibriSpeech ASR clean configuration (sourced from openslr/librispeech_asr). Audio is stored inline as FLAC bytes (no re-encoding); transcripts are sentence-embedded so semantic transcript search works out of the box.Video
OpenVid-1M
lance-format/openvid-lance — Lance format version of the OpenVid dataset with 937,957 high-quality videos stored with inline video blobs, embeddings, and rich metadata.Robotics
LeRobot PushT
lance-format/lerobot-pusht-lance — Lance-formatted version of lerobot/pusht — the canonical PushT benchmark from the Diffusion Policy paper — packaged using the same three-table layout as the existing lance-format/lerobot-xvla-soft-fold so consumers can flip between datasets without…LeRobot X-VLA Soft-Fold
lance-format/lerobot-xvla-soft-fold — This dataset was created using LeRobot.Share your own dataset
Got a multimodal dataset you want to publish? Convert it to Lance and push it to the Hub! Anyone who opens it gets vector search, full-text search, and filtered scans on the data out of the box, without recreating the embeddings or indexes on their end.Upload Lance datasets to the Hugging Face Hub
A step-by-step walkthrough on the LanceDB blog covering CLI setup, packaging your dataset, pushing to your namespace, and writing a dataset card.