> ## Documentation Index > Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt > Use this file to discover all available pages before exploring further. # Datasets > Browse Lance-format datasets ready to query directly from the Hugging Face Hub. The [`lance-format`](https://huggingface.co/lance-format) organization on Hugging Face publishes a growing catalog of multimodal datasets in Lance format. Each one bundles the raw data (images, audio, video, or text), pre-computed embeddings, and on-disk vector / full-text indices as first-class columns in the same dataset — so vector search, full-text search, and filtered scans work directly via `hf://` URIs without downloading. This is powered under the hood by the [Lance format's native Hugging Face integration](https://lance.org/integrations/huggingface/) (via the [`pylance`](https://pypi.org/project/pylance/) library). LanceDB sits on top of Lance and gives you a convenient table-style interface to query these datasets straight from the Hub: ```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}} import lancedb db = lancedb.connect("hf://datasets/lance-format//data") tbl = db.open_table("train") # Vector search, full-text search, or filtered scans — directly on the Hub results = tbl.search(query).limit(10).to_list() ``` Click any card below for usage examples, schema, and pre-built indices. For a complete walkthrough of the integration itself, see the [Hugging Face Hub integration page](/integrations/ai/huggingface). ## Image Classification `lance-format/mnist-lance` — A Lance-formatted version of the classic MNIST handwritten-digit dataset covering 70,000 28×28 grayscale digits across ten balanced classes. Each row carries inline PNG bytes, the digit label, the human-readable class name, and a cosine-normalized… `lance-format/cifar10-lance` — A Lance-formatted version of CIFAR-10 covering 60,000 32×32 RGB images across ten balanced object classes. Each row carries inline PNG bytes, the integer label, the human-readable class name, and a cosine-normalized CLIP image embedding, all backed… `lance-format/fashion-mnist-lance` — A Lance-formatted version of Fashion-MNIST covering 70,000 28×28 grayscale clothing images across ten balanced apparel classes. Each row carries inline PNG bytes, the integer label, the human-readable class name, and a cosine-normalized CLIP image… `lance-format/food101-lance` — A Lance-formatted version of Food-101, the fine-grained dish-classification benchmark of 101,000 photos spread evenly across 101 dish classes, sourced from ethz/food101. Each row carries the inline JPEG bytes, the integer label, the human-readable… `lance-format/oxford-pets-lance` — A Lance-formatted version of the Oxford-IIIT Pet dataset — 7,390 cat and dog photos across 37 breeds — sourced from pcuenq/oxford-pets. Each row carries the inline JPEG bytes, the breed name, a species flag distinguishing cats from dogs, and a… `lance-format/stanford-cars-lance` — A Lance-formatted version of the Stanford Cars fine-grained benchmark — 8,144 photographs across 196 make/model/year classes — sourced from Multimodal-Fatima/StanfordCars\_train. Each row carries the inline JPEG bytes, the integer class id, a… `lance-format/imagenet-1k-val-lance` — A Lance-formatted version of the canonical 50,000-image ImageNet-1k (ILSVRC2012) validation split, sourced from benjamin-paine/imagenet-1k. Each row is one image with its integer class id, a string class name, and a cosine-normalized OpenCLIP image… `lance-format/eurosat-lance` — A Lance-formatted version of EuroSAT, the canonical Sentinel-2 RGB land-cover benchmark, sourced from blanchon/EuroSAT\_RGB. Each row is a single 64×64 RGB tile with its integer class id, the human-readable class name, and a cosine-normalized… ## Object Detection & Segmentation `lance-format/coco-detection-2017-lance` — A Lance-formatted version of the COCO 2017 object detection benchmark, sourced from detection-datasets/coco. Each row is one image with its inline JPEG bytes, the full per-image list of bounding boxes, COCO 80-class category ids and names… `lance-format/pascal-voc-2012-segmentation-lance` — A Lance-formatted version of the Pascal VOC 2012 semantic segmentation split, sourced from nateraw/pascal-voc-2012. Each row pairs an inline JPEG image with the per-pixel PNG segmentation mask and a cosine-normalized OpenCLIP ViT-B-32 image… `lance-format/ade20k-lance` — A Lance-formatted version of the full ADE20K scene parsing benchmark, sourced from 1aurent/ADE20K. Each row is one scene image with its inline JPEG bytes, a per-pixel semantic segmentation map encoded as PNG bytes, an optional instance map, scene… `lance-format/kitti-2d-detection-lance` — A Lance-formatted version of the KITTI 2D Object Detection benchmark, sourced from nateraw/kitti so no manual signup or download from cvlibs.net is required. Each row is a single driving frame with inline JPEG bytes, the full set of 2D and 3D… ## Image Retrieval `lance-format/coco-captions-2017-lance` — A Lance-formatted version of the COCO Captions 2017 corpus, redistributed via lmms-lab/COCO-Caption2017. Each row is one image with 5–7 human-written captions, a cosine-normalized CLIP image embedding, and a cosine-normalized CLIP text embedding of… `lance-format/flickr30k-lance` — A Lance-formatted version of Flickr30k, redistributed via lmms-lab/flickr30k. Each row is one image with 5 human-written captions, a cosine-normalized CLIP image embedding, and a cosine-normalized CLIP text embedding of the canonical caption — all… `lance-format/laion-1m` — A Lance-formatted slice of the LAION image-text corpus (\~1M rows) with inline JPEG bytes, CLIP image embeddings (img\_emb), full metadata, and a pre-built ANN index — all available directly from the Hub at… ## Visual Question Answering `lance-format/chartqa-lance` — A Lance-formatted version of ChartQA, a benchmark for question answering over scientific and business charts that demands a mix of logical and visual reasoning, redistributed via lmms-lab/ChartQA. Each row carries the chart image as inline JPEG… `lance-format/docvqa-lance` — A Lance-formatted version of DocVQA, a benchmark for visual question answering over document images such as industry and government scans, multi-page reports, forms, and receipts, redistributed via lmms-lab/DocVQA (DocVQA config). Each row carries… `lance-format/textvqa-lance` — A Lance-formatted version of TextVQA — visual question answering where the question requires reading text in the image (street signs, product labels, screen captures) — sourced from lmms-lab/textvqa. Each row carries the image bytes, the question… `lance-format/vqav2-lance` — A Lance-formatted version of VQAv2 — open-ended visual question answering on COCO images — sourced from lmms-lab/VQAv2. Each row is one (image, question, 10 annotator answers) triple with paired CLIP image and question embeddings drawn from the… `lance-format/gqa-testdev-balanced-lance` — A Lance-formatted version of the canonical GQA testdev\_balanced slice — 12,578 compositional VQA questions joined against the matching 398 images — sourced from lmms-lab/GQA. The original redistribution ships instructions and images as separate… ## Text QA `lance-format/squad-v2-lance` — A Lance-formatted version of SQuAD v2 — the Stanford Question Answering Dataset with both answerable and deliberately unanswerable questions over Wikipedia passages — with MiniLM question embeddings stored inline and ready for retrieval at… `lance-format/trivia-qa-lance` — A Lance-formatted version of TriviaQA (rc.nocontext config) — a large reading-comprehension dataset of trivia questions paired with a canonical answer, accepted aliases, and entity-type metadata — with MiniLM question embeddings stored inline and… `lance-format/hotpotqa-distractor-lance` — A Lance-formatted version of HotpotQA using the distractor config — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs, with 10 candidate paragraphs per question (gold + 8… `lance-format/natural-questions-val-lance` — A Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries paired with the full Wikipedia article a human used to answer them, plus 1–5 annotator labels per question. MiniLM question embeddings are stored… `lance-format/ms-marco-v2.1-lance` — A Lance-formatted version of MS MARCO v2.1 — Microsoft's machine-reading-comprehension benchmark built from anonymized Bing query logs. Each row is one user query, the up-to-10 candidate passages Bing retrieved for it with relevance flags, and the… ## Text Corpora `lance-format/fineweb-edu` — A Lance-formatted version of FineWeb-Edu — over 1.5 billion educational web passages with cleaned text, source metadata, language detection signals, and 384-dim text embeddings — available directly from the Hub at… ## Speech `lance-format/librispeech-clean-lance` — A Lance-formatted version of the LibriSpeech ASR clean configuration, sourced from openslr/librispeech\_asr. Each row is one utterance with inline FLAC audio bytes, the reference transcript, a sentence-transformers embedding of that transcript, and… ## Video `lance-format/openvid-lance` — A Lance-formatted version of the OpenVid-1M corpus — 937,957 high-quality clips with inline MP4 bytes, 1024-dim video embeddings, captions, and rich per-clip quality signals — available directly from the Hub at… ## Robotics `lance-format/lerobot-pusht-lance` — A Lance-formatted version of lerobot/pusht — the canonical PushT benchmark from the Diffusion Policy paper — packaged using the same three-table layout as lance-format/lerobot-xvla-soft-fold so consumers can flip between datasets without changing… `lance-format/lerobot-xvla-soft-fold` — A Lance-formatted version of lerobot/xvla-soft-fold — a multi-camera robotics dataset from the X-VLA project — packaged as three Lance tables for efficient frame-level training, episode-level trajectory loading, and direct access to the original… ## Share your own dataset Got a multimodal dataset you want to publish? Convert it to Lance and push it to the Hub! Anyone who opens it gets vector search, full-text search, and filtered scans on the data out of the box, without recreating the embeddings or indexes on their end. A step-by-step walkthrough on the LanceDB blog covering CLI setup, packaging your dataset, pushing to your namespace, and writing a dataset card. Or browse the [latest trending Lance datasets](https://huggingface.co/datasets?format=format:lance\&sort=trending) on Hugging Face.