MNIST

View on Hugging Face

Source dataset card and downloadable files for lance-format/mnist-lance.

A Lance-formatted version of the classic MNIST handwritten-digit dataset with 70,000 28×28 grayscale digits stored inline alongside CLIP image embeddings and a pre-built ANN index.

Key features

All multimodal data (image bytes + embeddings) stored inline in the same Lance dataset — no sidecar files, no external image folders.
Pre-computed CLIP embeddings (OpenCLIP ViT-B-32 / laion2b_s34b_b79k, 512-dim, L2-normalized) shipped with an IVF_PQ index for instant similarity search.
BTREE index on label and BITMAP index on label_name for sub-millisecond filtering.
Standard train/test splits, ready to use with lance.dataset(...) or datasets.load_dataset(...).

Splits

Split	Rows
`train`	60,000
`test`	10,000

Schema

Column	Type	Notes
`id`	`int64`	Row index within the split
`image`	`large_binary`	Inline PNG bytes (28×28 grayscale)
`label`	`int32`	Digit class id (0-9)
`label_name`	`string`	Human-readable class (`"0".."9"`)
`image_emb`	`fixed_size_list<float32, 512>`	CLIP image embedding (cosine-normalized)

Pre-built indices

IVF_PQ on image_emb — vector similarity search (metric=cosine)
BTREE on label — fast equality / range filters
BITMAP on label_name — fast filters on the 10 class names

Load with `datasets.load_dataset`

import datasets

hf_ds = datasets.load_dataset("lance-format/mnist-lance", split="train", streaming=True)
for row in hf_ds.take(3):
    print(row["label"], row["label_name"])

Load directly with Lance (recommended)

import lance

ds = lance.dataset("hf://datasets/lance-format/mnist-lance/data/train.lance")
print(ds.count_rows(), ds.schema.names)
print(ds.list_indices())

Load with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/mnist-lance/data")
tbl = db.open_table("train")
print(len(tbl))

Tip — for production use, download locally first. Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy:
hf download lance-format/mnist-lance --repo-type dataset --local-dir ./mnist-lance
Then lance.dataset("./mnist-lance/data/train.lance").

Vector search example

import lance
import pyarrow as pa

ds = lance.dataset("hf://datasets/lance-format/mnist-lance/data/train.lance")
emb_field = ds.schema.field("image_emb")
ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"]
query = pa.array([ref], type=emb_field.type)

neighbors = ds.scanner(
    nearest={"column": "image_emb", "q": query[0], "k": 5, "nprobes": 16, "refine_factor": 30},
    columns=["id", "label", "label_name"],
).to_table().to_pylist()
print(neighbors)

LanceDB vector search

import lancedb

db = lancedb.connect("hf://datasets/lance-format/mnist-lance/data")
tbl = db.open_table("train")

ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0]
query_embedding = ref["image_emb"]

results = (
    tbl.search(query_embedding)
    .metric("cosine")
    .select(["id", "label", "label_name"])
    .limit(5)
    .to_list()
)
for row in results:
    print(row["id"], row["label"], row["label_name"])

Filter by class

ds = lance.dataset("hf://datasets/lance-format/mnist-lance/data/train.lance")
sevens = ds.scanner(filter="label = 7", columns=["id"], limit=10).to_table()
print(sevens)

Filter by class with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/mnist-lance/data")
tbl = db.open_table("train")
sevens = (
    tbl.search()
    .where("label = 7")
    .select(["id"])
    .limit(10)
    .to_list()
)
print(sevens)

Working with images

from pathlib import Path
import lance

ds = lance.dataset("hf://datasets/lance-format/mnist-lance/data/train.lance")
row = ds.take([0], columns=["image", "label"]).to_pylist()[0]
Path("digit_0.png").write_bytes(row["image"])
print("label =", row["label"])

Images are stored inline as PNG bytes; scanning columns like label does not pay the I/O cost of loading image bytes.

Why Lance?

One dataset for images + embeddings + indices + metadata — no sidecar files to manage.
On-disk vector and full-text indices live next to the data, so search works on both local copies and the Hub.
Schema evolution lets you add new columns (fresh embeddings, augmentations, model predictions) without rewriting the data (docs).

Source & license

Converted from ylecun/mnist. MNIST is released under the MIT license. The original dataset is by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges.

Citation

@article{lecun1998mnist,
  title={The MNIST database of handwritten digits},
  author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
  url={http://yann.lecun.com/exdb/mnist/},
  year={1998}
}

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

View on Hugging Face

Key features

Splits

Schema

Pre-built indices

Load with `datasets.load_dataset`

Load directly with Lance (recommended)

Load with LanceDB

Vector search example

LanceDB vector search

Filter by class

Filter by class with LanceDB

Working with images

Why Lance?

Source & license

Citation

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Documentation Index

View on Hugging Face

​Key features

​Splits

​Schema

​Pre-built indices

​Load with datasets.load_dataset

​Load directly with Lance (recommended)

​Load with LanceDB

​Vector search example

​LanceDB vector search

​Filter by class

​Filter by class with LanceDB

​Working with images

​Why Lance?

​Source & license

​Citation

Key features

Splits

Schema

Pre-built indices

Load with `datasets.load_dataset`

Load directly with Lance (recommended)

Load with LanceDB

Vector search example

LanceDB vector search

Filter by class

Filter by class with LanceDB

Working with images

Why Lance?

Source & license

Citation