Documentation Index
Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
Use this file to discover all available pages before exploring further.
View on Hugging Face
Source dataset card and downloadable files for lance-format/laion-1m.
A lance dataset of LAION image-text corpus (~1M rows) with inline JPEG bytes, CLIP embeddings (img_emb), and full metadata available directly from the Hub: hf://datasets/lance-format/laion-1m/data/train.lance.
Key Features
- Images stored inline – the
image column is binary data, so sampling/exporting images never leaves Lance.
- Prebuilt ANN index –
img_emb ships with IVF_PQ for instant similarity search.
- Metadata rich – captions, URLs, NSFW flags, EXIF, dimensions, similarity scores, etc.
- Lance<>HF integration – access via
datasets or connect with Lance for ANN search, image export, and any operation that needs the vector index or binary blobs.
Load with datasets.load_dataset
import datasets
hf_ds = datasets.load_dataset(
"lance-format/laion-1m",
split="train",
streaming=True
)
# Take first three rows and print captions
for row in hf_ds.take(3):
print(row["caption"])
Load with Lance
Use Lance for ANN search, image export, and any operation that needs the vector index or binary blobs:
import lance
ds = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance")
print(ds.count_rows())
These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb
db = lancedb.connect("hf://datasets/lance-format/laion-subset/data")
tbl = db.open_table("train")
print(f"LanceDB table opened with {len(tbl)} image-text pairs")
⚠️ HuggingFace Streaming Note
- Download the dataset locally (
huggingface-cli download lance-format/laion-1m --repo-type dataset --local-dir ./laion) for heavy usage, then point Lance at ./laion to use the IVF_PQ index.
Why Lance?
- Optimized for AI workloads: Lance keeps multimodal data and vector search-ready storage in the same columnar format designed for accelerator-era retrieval (see lance.org).
- Images + embeddings + metadata travel as one tabular dataset.
- On-disk, scalable ANN index
- Schema evolution lets you add new features/columns (moderation tags, embeddings, etc.) without rewriting the raw data.
Quick Start (Lance)
Inspecting Existing Indices
This dataset comes with a built in vector (IVF) index for image embeddings. You can inspect the prebuilt indices on the dataset:
import lance
dataset = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance")
# List all indices
indices = dataset.list_indices()
print(indices)
While this dataset comes with pre-built indices, you can also create your own custom indices if needed. For example:
# ds is a local Lance dataset
ds.create_index(
"img_emb",
index_type="IVF_PQ",
num_partitions=256,
num_sub_vectors=96,
replace=True,
)
# ds is a local Lance dataset
ds.create_fts_index("caption")
Quick Start (Lance)
import lance
import pyarrow as pa
lance_ds = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance")
# Vector search via img_emb IVF_PQ index
emb_field = lance_ds.schema.field("img_emb")
query = pa.array(list(range(768)), type=emb_field.type)
neighbors = lance_ds.scanner(
nearest={
"column": emb_field.name,
"q": query[0],
"k": 6,
"nprobes": 16,
"refine_factor": 30,
},
columns=["caption", "url", "similarity"],
).to_table().to_pylist()
Storing & Retrieving Multimodal Data
from pathlib import Path
rows = lance_ds.take([0, 1], columns=["image", "caption"]).to_pylist()
for idx, row in enumerate(rows):
Path("samples").mkdir(exist_ok=True)
with open(f"samples/{idx}.jpg", "wb") as f:
f.write(row["image"])
Images are stored inline as binary columns (regular Lance binary, not the special blob handle used in OpenVid). They behave like any other column—scan captions without touching image, then take() when you want the bytes.
Dataset Schema
Core fields:
image_path, image
caption, url
NSFW (uppercase), similarity, LICENSE, key, status, error_message
width, height, original_width, original_height
exif, md5
img_emb
Usage Examples
scanner = ds.scanner(columns=["caption", "url", "similarity"], limit=5)
for row in scanner.to_table().to_pylist():
print(row)
2. Export images
rows = ds.take(range(3), columns=["image", "caption"]).to_pylist()
for i, row in enumerate(rows):
with open(f"sample_{i}.jpg", "wb") as f:
f.write(row["image"])
3. Vector similarity search
emb_field = ds.schema.field("img_emb")
ref = ds.take([123], columns=["img_emb"]).to_pylist()[0]
query = pa.array([ref["img_emb"]], type=emb_field.type)
neighbors = ds.scanner(
nearest={
"column": emb_field.name,
"q": query[0],
"k": 6,
"nprobes": 16,
"refine_factor": 30,
},
columns=["caption", "url", "similarity"],
).to_table().to_pylist()
LanceDB Vector Similarity Search
import lancedb
db = lancedb.connect("hf://datasets/lance-format/laion-1m/data")
query_embedding = list(range(768))
results = tbl.search(query_embedding) \
.limit(5) \
.to_list()
LanceDB Full-Text Search
import lancedb
db = lancedb.connect("hf://datasets/lance-format/laion-1m/data")
tbl = db.open_table("train")
results = tbl.search("dog running") \
.select(["caption", "url", "similarity"]) \
.limit(10) \
.to_list()
Dataset Evolution
Lance supports flexible schema and data evolution (docs). You can add/drop columns, backfill with SQL or Python, rename fields, or change data types without rewriting the whole dataset. In practice this lets you:
- Introduce fresh metadata (moderation labels, embeddings, quality scores) as new signals become available.
- Add new columns to existing datasets without re-exporting terabytes of video.
- Adjust column names or shrink storage (e.g., cast embeddings to float16) while keeping previous snapshots queryable for reproducibility.
import lance
import pyarrow as pa
import numpy as np
# Assumes you ran the export to Lance example above to store a local subset of the data
# ds = lance.dataset("./laion_1m_local")
# 1. Add a schema-only column (data to be added later)
dataset.add_columns(pa.field("moderation_label", pa.string()))
# 2. Add a column with data backfill using a SQL expression
dataset.add_columns(
{
"moderation_label": "case WHEN \"NSFW\" > 0.5 THEN 'review' ELSE 'ok' END"
}
)
# 3. Generate rich columns via Python batch UDFs
@lance.batch_udf()
def random_embedding(batch):
arr = np.random.rand(batch.num_rows, 128).astype("float32")
return pa.RecordBatch.from_arrays(
[pa.FixedSizeListArray.from_arrays(arr.ravel(), 128)],
names=["embedding"],
)
dataset.add_columns(random_embedding)
# 4. Bring in offline annotations with merge
labels = pa.table({
"id": pa.array([1, 2, 3]),
"label": pa.array(["horse", "rabbit", "cat"]),
})
dataset.merge(labels, "id")
# 5. Rename or cast columns as needs change
dataset.alter_columns({"path": "quality_bucket", "name": "quality_tier"})
dataset.alter_columns({"path": "embedding", "data_type": pa.list_(pa.float16(), 128)})
These operations are automatically versioned, so prior experiments can still point to earlier versions while the dataset keeps evolving.
Citation
@article{schuhmann2022laion5b,
title={LAION-5B: An open large-scale dataset for training next generation image-text models},
author={Schuhmann, Christoph and others},
journal={NeurIPS Datasets and Benchmarks Track},
year={2022}
}
License
Content inherits LAION’s original licensing and safety guidelines. Review LAION policy before downstream use.