Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt

Use this file to discover all available pages before exploring further.

https://mintcdn.com/lancedb-bcbb4faf/6L0IRVkfdlgMU1Pw/static/assets/logo/huggingface-logo.svg?fit=max&auto=format&n=6L0IRVkfdlgMU1Pw&q=85&s=da940a105a40440f0cd1224d3fa4ae52

View on Hugging Face

Source dataset card and downloadable files for lance-format/gqa-testdev-balanced-lance.
Lance-formatted version of the canonical GQA testdev_balanced slice — 12,578 compositional VQA questions joined with the matching 398 images — sourced from lmms-lab/GQA. lmms-lab/GQA exposes instructions and images as separate parquet configs; this Lance dataset joins them on imageId, so each row has the question, the answer, the GQA reasoning-program tags, and the image bytes inline.

Splits

SplitRowsDistinct images
testdev.lance12,578398
Train (train_balanced_instructions × train_balanced_images, ~943k Q’s × 72k images, ~10 GB images) and val splits are not bundled by default — pass --instr-config/--images-config to gqa/dataprep.py to extend.

Schema

ColumnTypeNotes
idint64Row index
imagelarge_binaryInline JPEG bytes (image is duplicated across rows that share an image_id)
image_idstringGQA scene-graph image id
question_idstringGQA question id
questionstringCompositional natural-language question
answerslist<string>One-element list (the GQA short answer)
answerstringSame short answer (canonical / FTS target)
full_answerstring?Full sentence answer
structuralstring?One of verify, query, compare, choose, logical
semanticstring?One of attr, cat, global, obj, rel
detailedstring?Fine-grained type (e.g. weatherVerifyC)
is_balancedboolGQA balanced subset flag
group_global / group_localstring?GQA reasoning-group ids
semantic_strstring?Compact description of the reasoning program
image_embfixed_size_list<float32, 512>CLIP image embedding (cosine-normalized)
question_embfixed_size_list<float32, 512>CLIP text embedding of the question

Pre-built indices

  • IVF_PQ on image_emb and question_embmetric=cosine
  • INVERTED (FTS) on question and answer
  • BITMAP on structural, semantic, detailed
  • BTREE on image_id, question_id

Quick start

import lance
ds = lance.dataset("hf://datasets/lance-format/gqa-testdev-balanced-lance/data/testdev.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())

Load with LanceDB

These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/gqa-testdev-balanced-lance/data")
tbl = db.open_table("testdev")
print(f"LanceDB table opened with {len(tbl)} image-question pairs")
import lancedb

db = lancedb.connect("hf://datasets/lance-format/gqa-testdev-balanced-lance/data")
tbl = db.open_table("testdev")

ref = tbl.search().limit(1).select(["question_emb", "question"]).to_list()[0]
query_embedding = ref["question_emb"]

results = (
    tbl.search(query_embedding, vector_column_name="question_emb")
    .metric("cosine")
    .select(["question", "answer"])
    .limit(5)
    .to_list()
)
import lancedb

db = lancedb.connect("hf://datasets/lance-format/gqa-testdev-balanced-lance/data")
tbl = db.open_table("testdev")

results = (
    tbl.search("color of the car")
    .select(["question", "answer"])
    .limit(10)
    .to_list()
)

Filter by reasoning type

import lance
ds = lance.dataset("hf://datasets/lance-format/gqa-testdev-balanced-lance/data/testdev.lance")
verify_qs = ds.scanner(filter="structural = 'verify'", columns=["question", "answer"], limit=5).to_table()

Filter with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/gqa-testdev-balanced-lance/data")
tbl = db.open_table("testdev")
verify_qs = (
    tbl.search()
    .where("structural = 'verify'")
    .select(["question", "answer"])
    .limit(5)
    .to_list()
)

Why Lance?

  • One dataset for the joined image + question + answer + reasoning-program metadata + dual embeddings + indices — no instructions/images parquet split to keep in sync.
  • Schema evolution: add columns (alternate scene graphs, model predictions) without rewriting the data.

Source & license

Converted from lmms-lab/GQA. GQA is released under CC BY 4.0 by Hudson and Manning (Stanford NLP).

Citation

@inproceedings{hudson2019gqa,
  title={GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering},
  author={Hudson, Drew A. and Manning, Christopher D.},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2019}
}