Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt

Use this file to discover all available pages before exploring further.

https://mintcdn.com/lancedb-bcbb4faf/6L0IRVkfdlgMU1Pw/static/assets/logo/huggingface-logo.svg?fit=max&auto=format&n=6L0IRVkfdlgMU1Pw&q=85&s=da940a105a40440f0cd1224d3fa4ae52

View on Hugging Face

Source dataset card and downloadable files for lance-format/cifar10-lance.
A Lance-formatted version of CIFAR-10 with 60,000 32×32 RGB images across 10 classes, stored inline with CLIP embeddings and a pre-built IVF_PQ ANN index.

Key features

  • All multimodal data (image bytes + embeddings) stored inline in the same Lance dataset.
  • Pre-computed CLIP embeddings (OpenCLIP ViT-B-32 / laion2b_s34b_b79k, 512-dim, L2-normalized) with an IVF_PQ index.
  • BTREE on label and BITMAP on label_name for fast filtered scans.

Splits

SplitRows
train50,000
test10,000

Schema

ColumnTypeNotes
idint64Row index within the split
imagelarge_binaryInline PNG bytes (32×32 RGB)
labelint32Class id (0-9)
label_namestringOne of airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck
image_embfixed_size_list<float32, 512>CLIP image embedding (cosine-normalized)

Pre-built indices

  • IVF_PQ on image_embmetric=cosine
  • BTREE on label
  • BITMAP on label_name

Load with datasets.load_dataset

import datasets

hf_ds = datasets.load_dataset("lance-format/cifar10-lance", split="train", streaming=True)
for row in hf_ds.take(3):
    print(row["label_name"])
import lance

ds = lance.dataset("hf://datasets/lance-format/cifar10-lance/data/train.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())

Load with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/cifar10-lance/data")
tbl = db.open_table("train")
print(len(tbl))
Tip — for production use, download locally first.
hf download lance-format/cifar10-lance --repo-type dataset --local-dir ./cifar10-lance

Vector search example

import lance
import pyarrow as pa

ds = lance.dataset("hf://datasets/lance-format/cifar10-lance/data/train.lance")
emb_field = ds.schema.field("image_emb")
ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"]
query = pa.array([ref], type=emb_field.type)

neighbors = ds.scanner(
    nearest={"column": "image_emb", "q": query[0], "k": 5, "nprobes": 16, "refine_factor": 30},
    columns=["id", "label_name"],
).to_table().to_pylist()
print(neighbors)
import lancedb

db = lancedb.connect("hf://datasets/lance-format/cifar10-lance/data")
tbl = db.open_table("train")

ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0]
query_embedding = ref["image_emb"]

results = (
    tbl.search(query_embedding)
    .metric("cosine")
    .select(["id", "label_name"])
    .limit(5)
    .to_list()
)
for row in results:
    print(row["id"], row["label_name"])

Filter by class

import lance
ds = lance.dataset("hf://datasets/lance-format/cifar10-lance/data/train.lance")
ships = ds.scanner(filter="label_name = 'ship'", columns=["id"], limit=5).to_table()

Filter by class with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/cifar10-lance/data")
tbl = db.open_table("train")
ships = (
    tbl.search()
    .where("label_name = 'ship'")
    .select(["id"])
    .limit(5)
    .to_list()
)

Working with images

from pathlib import Path
import lance

ds = lance.dataset("hf://datasets/lance-format/cifar10-lance/data/train.lance")
row = ds.take([0], columns=["image", "label_name"]).to_pylist()[0]
Path(f"sample_{row['label_name']}.png").write_bytes(row["image"])

Why Lance?

  • One dataset for images + embeddings + indices + metadata — no sidecar files.
  • On-disk vector and FTS indices live next to the data, so search works on both local copies and the Hub.
  • Schema evolution: add new columns (model predictions, fresh embeddings, augmentations) without rewriting the data.

Source & license

Converted from uoft-cs/cifar10. CIFAR-10 was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton at the University of Toronto.

Citation

@techreport{krizhevsky2009cifar10,
  title={Learning multiple layers of features from tiny images},
  author={Krizhevsky, Alex and Hinton, Geoffrey},
  year={2009},
  institution={University of Toronto}
}