Documentation Index
Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
Use this file to discover all available pages before exploring further.
View on Hugging Face
Source dataset card and downloadable files for lance-format/coco-captions-2017-lance.
Lance-formatted version of the COCO Captions 2017 corpus, redistributed via lmms-lab/COCO-Caption2017. Each row is one image with 5–7 human-written captions, CLIP image embedding, and CLIP text embedding of the canonical caption — all stored inline.
Splits
| Split | Rows |
|---|
val.lance | 5,000 (canonical COCO 2017 val set) |
test.lance | 40,700 |
The 2017 train split (118 k images, ~18 GB of source JPEGs) is intentionally
not bundled here because the lmms-lab/COCO-Caption2017 redistribution does
not include it. To extend with train, run coco_captions_2017/dataprep.py
against your local COCO 2017 train mirror.
Schema
| Column | Type | Notes |
|---|
id | int64 | Row index within split |
image | large_binary | Inline JPEG bytes |
image_id | string | COCO image id |
filename | string | Original filename (e.g. 000000179765.jpg) |
captions | list<string> | All 5–7 captions |
caption | string | First caption — used as canonical text for FTS |
image_emb | fixed_size_list<float32, 512> | CLIP image embedding (cosine-normalized) |
text_emb | fixed_size_list<float32, 512> | CLIP text embedding of the canonical caption |
Pre-built indices
IVF_PQ on image_emb and text_emb — metric=cosine
INVERTED on caption
BTREE on image_id
Quick start
import lance
ds = lance.dataset("hf://datasets/lance-format/coco-captions-2017-lance/data/val.lance")
print(ds.count_rows(), ds.schema.names)
print(ds.list_indices())
Load with LanceDB
These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb
db = lancedb.connect("hf://datasets/lance-format/coco-captions-2017-lance/data")
tbl = db.open_table("val")
print(f"LanceDB table opened with {len(tbl)} image-caption pairs")
Tip — for production use, download locally first.
hf download lance-format/coco-captions-2017-lance --repo-type dataset --local-dir ./coco-captions-2017-lance
Vector search examples
Cross-modal text→image:
import lance, open_clip, pyarrow as pa, torch
model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k")
tokenizer = open_clip.get_tokenizer("ViT-B-32")
model = model.eval().cuda().half()
with torch.no_grad():
q = model.encode_text(tokenizer(["a giraffe eating leaves"]).cuda())
q = (q / q.norm(dim=-1, keepdim=True)).float().cpu().numpy()[0]
ds = lance.dataset("hf://datasets/lance-format/coco-captions-2017-lance/data/val.lance")
emb_field = ds.schema.field("image_emb")
hits = ds.scanner(
nearest={"column": "image_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 10},
columns=["image_id", "caption"],
).to_table().to_pylist()
LanceDB cross-modal text→image search
import lancedb, open_clip, torch
model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k")
tokenizer = open_clip.get_tokenizer("ViT-B-32")
model = model.eval().cuda().half()
with torch.no_grad():
q = model.encode_text(tokenizer(["a giraffe eating leaves"]).cuda())
q = (q / q.norm(dim=-1, keepdim=True)).float().cpu().numpy()[0]
db = lancedb.connect("hf://datasets/lance-format/coco-captions-2017-lance/data")
tbl = db.open_table("val")
results = (
tbl.search(q.tolist(), vector_column_name="image_emb")
.metric("cosine")
.select(["image_id", "caption"])
.limit(10)
.to_list()
)
Full-text search:
ds = lance.dataset("hf://datasets/lance-format/coco-captions-2017-lance/data/val.lance")
hits = ds.scanner(
full_text_query="surfer riding a wave",
columns=["image_id", "caption"],
limit=10,
).to_table().to_pylist()
LanceDB full-text search
import lancedb
db = lancedb.connect("hf://datasets/lance-format/coco-captions-2017-lance/data")
tbl = db.open_table("val")
results = (
tbl.search("surfer riding a wave")
.select(["image_id", "caption"])
.limit(10)
.to_list()
)
Why Lance?
- One dataset carries images + image embeddings + text embeddings + indices — no sidecar files.
- On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub.
- Schema evolution: add columns (new captions, alternate embeddings, model predictions) without rewriting the data.
Source & license
Converted from lmms-lab/COCO-Caption2017. Original COCO 2017 annotations are released under CC BY 4.0; the underlying images are subject to Flickr terms of service. Please review the COCO Terms of Use before redistribution.
Citation
@inproceedings{lin2014microsoft,
title={Microsoft COCO: Common objects in context},
author={Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\'a}r, Piotr and Zitnick, C Lawrence},
booktitle={European Conference on Computer Vision (ECCV)},
year={2014},
}