Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt

Use this file to discover all available pages before exploring further.

https://mintcdn.com/lancedb-bcbb4faf/6L0IRVkfdlgMU1Pw/static/assets/logo/huggingface-logo.svg?fit=max&auto=format&n=6L0IRVkfdlgMU1Pw&q=85&s=da940a105a40440f0cd1224d3fa4ae52

View on Hugging Face

Source dataset card and downloadable files for lance-format/fashion-mnist-lance.
A Lance-formatted version of Fashion-MNIST with 70,000 28×28 grayscale clothing images stored inline alongside CLIP embeddings and a pre-built IVF_PQ ANN index.

Key features

  • All multimodal data (image bytes + embeddings) stored inline in the same Lance dataset.
  • Pre-computed CLIP embeddings (OpenCLIP ViT-B-32 / laion2b_s34b_b79k, 512-dim, L2-normalized) with an IVF_PQ index.
  • BTREE on label and BITMAP on label_name for fast filtered scans.

Splits

SplitRows
train60,000
test10,000

Schema

ColumnTypeNotes
idint64Row index within the split
imagelarge_binaryInline PNG bytes (28×28 grayscale)
labelint32Class id (0-9)
label_namestringOne of T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle_boot
image_embfixed_size_list<float32, 512>CLIP image embedding (cosine-normalized)

Pre-built indices

  • IVF_PQ on image_embmetric=cosine
  • BTREE on label
  • BITMAP on label_name

Load with Lance

import lance

ds = lance.dataset("hf://datasets/lance-format/fashion-mnist-lance/data/train.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())

Load with LanceDB

These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/fashion-mnist-lance/data")
tbl = db.open_table("train")
print(f"LanceDB table opened with {len(tbl)} images")

Load with datasets.load_dataset

import datasets

hf_ds = datasets.load_dataset("lance-format/fashion-mnist-lance", split="train", streaming=True)
for row in hf_ds.take(3):
    print(row["label_name"])
Tip — for production use, download locally first to avoid Hub rate limits:
hf download lance-format/fashion-mnist-lance --repo-type dataset --local-dir ./fashion-mnist-lance

Vector search example

import lance
import pyarrow as pa

ds = lance.dataset("hf://datasets/lance-format/fashion-mnist-lance/data/train.lance")
emb_field = ds.schema.field("image_emb")
ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"]
query = pa.array([ref], type=emb_field.type)

neighbors = ds.scanner(
    nearest={"column": "image_emb", "q": query[0], "k": 5, "nprobes": 16, "refine_factor": 30},
    columns=["id", "label_name"],
).to_table().to_pylist()
import lancedb

db = lancedb.connect("hf://datasets/lance-format/fashion-mnist-lance/data")
tbl = db.open_table("train")

ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0]
query_embedding = ref["image_emb"]

results = (
    tbl.search(query_embedding)
    .metric("cosine")
    .select(["id", "label_name"])
    .limit(5)
    .to_list()
)

Filter by class

import lance
ds = lance.dataset("hf://datasets/lance-format/fashion-mnist-lance/data/train.lance")
sneakers = ds.scanner(filter="label_name = 'Sneaker'", columns=["id"], limit=5).to_table()

Filter by class with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/fashion-mnist-lance/data")
tbl = db.open_table("train")
sneakers = tbl.search().where("label_name = 'Sneaker'").select(["id"]).limit(5).to_list()

Why Lance?

  • One dataset for images + embeddings + indices + metadata — no sidecar files.
  • On-disk vector and FTS indices live next to the data, so search works on local copies and the Hub.
  • Schema evolution: add new columns (model predictions, fresh embeddings, augmentations) without rewriting the data.

Source & license

Converted from zalando-datasets/fashion_mnist. Released under the MIT license.

Citation

@online{xiao2017fashionmnist,
  title={Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
  author={Xiao, Han and Rasul, Kashif and Vollgraf, Roland},
  year={2017},
  eprint={1708.07747},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}