Datasets - LanceDB

The lance-format organization on Hugging Face publishes a growing catalog of multimodal datasets in Lance format. Each one bundles the raw data (images, audio, video, or text), pre-computed embeddings, and on-disk vector / full-text indices as first-class columns in the same dataset — so vector search, full-text search, and filtered scans work directly via hf:// URIs without downloading. This is powered under the hood by the Lance format’s native Hugging Face integration (via the pylance library). LanceDB sits on top of Lance and gives you a convenient table-style interface to query these datasets straight from the Hub:

import lancedb

db = lancedb.connect("hf://datasets/lance-format/<dataset-name>/data")
tbl = db.open_table("train")

# Vector search, full-text search, or filtered scans — directly on the Hub
results = tbl.search(query).limit(10).to_list()

Click any card below for usage examples, schema, and pre-built indices. For a complete walkthrough of the integration itself, see the Hugging Face Hub integration page.

Image Classification

MNIST

lance-format/mnist-lance — A Lance-formatted version of the classic MNIST handwritten-digit dataset with 70,000 28×28 grayscale digits stored inline alongside CLIP image embeddings and a pre-built ANN index.

CIFAR-10

lance-format/cifar10-lance — A Lance-formatted version of CIFAR-10 with 60,000 32×32 RGB images across 10 classes, stored inline with CLIP embeddings and a pre-built IVF_PQ ANN index.

Fashion-MNIST

lance-format/fashion-mnist-lance — A Lance-formatted version of Fashion-MNIST with 70,000 28×28 grayscale clothing images stored inline alongside CLIP embeddings and a pre-built IVF_PQ ANN index.

Food-101

lance-format/food101-lance — Lance-formatted version of Food-101 — 101,000 food photographs across 101 classes — sourced from ethz/food101. Inline JPEG bytes + CLIP image embeddings + IVF_PQ.

Oxford-IIIT Pet

lance-format/oxford-pets-lance — Lance-formatted version of the Oxford-IIIT Pet dataset — 7,390 cat & dog photos across 37 breeds — sourced from pcuenq/oxford-pets.

Stanford Cars

lance-format/stanford-cars-lance — Lance-formatted version of the Stanford Cars dataset — 8,144 training images across 196 fine-grained car make/model/year classes — sourced from Multimodal-Fatima/StanfordCars_train.

ImageNet-1k Validation

lance-format/imagenet-1k-val-lance — A Lance-formatted version of the canonical 50,000-image ImageNet-1k validation split (also known as ILSVRC2012 val) sourced from benjamin-paine/imagenet-1k. All 50 k JPEGs are stored inline alongside CLIP embeddings and a pre-built IVF_PQ ANN index.

EuroSAT

lance-format/eurosat-lance — Lance-formatted version of EuroSAT — Sentinel-2 satellite imagery (RGB) covering 27,000 64×64 tiles across 10 land-cover classes, sourced from blanchon/EuroSAT_RGB.

Object Detection & Segmentation

COCO 2017 Detection

lance-format/coco-detection-2017-lance — Lance-formatted version of the COCO 2017 object detection benchmark — sourced from detection-datasets/coco — with 123,287 images and the full per-image list of bounding boxes, category labels, and CLIP image embeddings, all stored inline.

Pascal VOC 2012 Segmentation

lance-format/pascal-voc-2012-segmentation-lance — A Lance-formatted version of the Pascal VOC 2012 semantic segmentation split (sourced from nateraw/pascal-voc-2012) — 2,913 image / mask pairs with CLIP image embeddings stored inline and a pre-built IVF_PQ ANN index.

ADE20K

lance-format/ade20k-lance — Lance-formatted version of the full ADE20K scene parsing benchmark (sourced from 1aurent/ADE20K) — 27,574 scene images with semantic and instance segmentation maps, scene labels, and per-object metadata, all stored inline.

KITTI 2D Detection

lance-format/kitti-2d-detection-lance — Lance-formatted version of the KITTI 2D Object Detection benchmark — 7,481 training images from the KITTI Vision Benchmark Suite with 2D bounding boxes plus the full 3D-box / observation-angle metadata. Sourced from nateraw/kitti so no manual…

Image Retrieval

COCO Captions 2017

lance-format/coco-captions-2017-lance — Lance-formatted version of the COCO Captions 2017 corpus, redistributed via lmms-lab/COCO-Caption2017. Each row is one image with 5–7 human-written captions, CLIP image embedding, and CLIP text embedding of the canonical caption — all stored inline.

Flickr30k

lance-format/flickr30k-lance — Lance-formatted version of Flickr30k (re-distributed via lmms-lab/flickr30k) — 31,783 images, each paired with 5 human-written captions, with CLIP image and text embeddings stored inline and pre-built ANN indices on both.

LAION-1M

lance-format/laion-1m — A lance dataset of LAION image-text corpus (~1M rows) with inline JPEG bytes, CLIP embeddings (img_emb), and full metadata available directly from the Hub: hf://datasets/lance-format/laion-1m/data/train.lance.

Visual Question Answering

ChartQA

lance-format/chartqa-lance — Lance-formatted version of ChartQA — VQA over scientific and business charts that combine logical and visual reasoning — sourced from lmms-lab/ChartQA.

DocVQA

lance-format/docvqa-lance — Lance-formatted version of DocVQA — VQA over document images (industry / government scans, multi-page reports, forms, receipts) — sourced from lmms-lab/DocVQA (DocVQA config).

TextVQA

lance-format/textvqa-lance — Lance-formatted version of TextVQA — VQA where the question requires reading text in the image — sourced from lmms-lab/textvqa.

VQAv2

lance-format/vqav2-lance — Lance-formatted version of VQAv2 — Visual Question Answering on COCO images, sourced from lmms-lab/VQAv2. Each row is a (image, question, 10 answers) triple with two CLIP embeddings (image + question text) so the same dataset supports both visual…

GQA testdev-balanced

lance-format/gqa-testdev-balanced-lance — Lance-formatted version of the canonical GQA testdev_balanced slice — 12,578 compositional VQA questions joined with the matching 398 images — sourced from lmms-lab/GQA.

Text QA

SQuAD v2

lance-format/squad-v2-lance — Lance-formatted version of SQuAD v2 — Stanford Question Answering Dataset, version 2 — with MiniLM sentence embeddings stored inline alongside the questions, contexts, and answers.

TriviaQA

lance-format/trivia-qa-lance — Lance-formatted version of TriviaQA (rc.nocontext config) — a question-answering dataset of trivia questions paired with answer aliases — with MiniLM sentence embeddings stored inline.

HotpotQA distractor

lance-format/hotpotqa-distractor-lance — Lance-formatted version of HotpotQA — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs — using the distractor config (10 candidate paragraphs per question, including gold + 8…

Natural Questions Validation

lance-format/natural-questions-val-lance — Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries with their full Wikipedia articles and 1–5 annotator labels per question. Sourced from google-research-datasets/natural_questions.

MS MARCO v2.1

lance-format/ms-marco-v2.1-lance — Lance-formatted version of MS MARCO v2.1 — Microsoft’s machine reading comprehension benchmark — with MiniLM query embeddings stored inline alongside the candidate passages and human-written answers.

Text Corpora

FineWeb-Edu

lance-format/fineweb-edu — FineWeb-edu dataset with over 1.5 billion rows. Each passage ships with cleaned text, metadata, and 384-dim text embeddings for retrieval-heavy workloads.

Speech

LibriSpeech clean

lance-format/librispeech-clean-lance — Lance-formatted version of the LibriSpeech ASR clean configuration (sourced from openslr/librispeech_asr). Audio is stored inline as FLAC bytes (no re-encoding); transcripts are sentence-embedded so semantic transcript search works out of the box.

Video

OpenVid-1M

lance-format/openvid-lance — Lance format version of the OpenVid dataset with 937,957 high-quality videos stored with inline video blobs, embeddings, and rich metadata.

Robotics

LeRobot PushT

lance-format/lerobot-pusht-lance — Lance-formatted version of lerobot/pusht — the canonical PushT benchmark from the Diffusion Policy paper — packaged using the same three-table layout as the existing lance-format/lerobot-xvla-soft-fold so consumers can flip between datasets without…

LeRobot X-VLA Soft-Fold

lance-format/lerobot-xvla-soft-fold — This dataset was created using LeRobot.

Got a multimodal dataset you want to publish? Convert it to Lance and push it to the Hub! Anyone who opens it gets vector search, full-text search, and filtered scans on the data out of the box, without recreating the embeddings or indexes on their end.

Upload Lance datasets to the Hugging Face Hub

A step-by-step walkthrough on the LanceDB blog covering CLI setup, packaging your dataset, pushing to your namespace, and writing a dataset card.

Or browse the latest trending Lance datasets on Hugging Face.

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Documentation Index

​Image Classification

MNIST

CIFAR-10

Fashion-MNIST

Food-101

Oxford-IIIT Pet

Stanford Cars

ImageNet-1k Validation

EuroSAT

​Object Detection & Segmentation

COCO 2017 Detection

Pascal VOC 2012 Segmentation

ADE20K

KITTI 2D Detection

​Image Retrieval

COCO Captions 2017

Flickr30k

LAION-1M

​Visual Question Answering

ChartQA

DocVQA

TextVQA

VQAv2

GQA testdev-balanced

​Text QA

SQuAD v2

TriviaQA

HotpotQA distractor

Natural Questions Validation

MS MARCO v2.1

​Text Corpora

FineWeb-Edu

​Speech

LibriSpeech clean

​Video

OpenVid-1M

​Robotics

LeRobot PushT

LeRobot X-VLA Soft-Fold

​Share your own dataset

Upload Lance datasets to the Hugging Face Hub

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Share your own dataset