ChartQA - LanceDB

View on Hugging Face

Source dataset card and downloadable files for lance-format/chartqa-lance.

A Lance-formatted version of ChartQA, a benchmark for question answering over scientific and business charts that demands a mix of logical and visual reasoning, redistributed via lmms-lab/ChartQA. Each row carries the chart image as inline JPEG bytes, the natural-language question and reference answer(s), a question-type tag (human vs augmented), and paired CLIP embeddings for the image and the question — all available directly from the Hub at hf://datasets/lance-format/chartqa-lance/data.

Key features

Inline chart image bytes in the image column — no sidecar files, no image folders.
Paired CLIP embeddings in the same row — image_emb and question_emb (ViT-B/32, 512-dim, cosine-normalized) — so visual and textual retrieval are one indexed lookup.
All reference answers preserved in answers alongside a canonical answer string used for full-text search.
Pre-built ANN, FTS, and scalar indices covering both embedding columns, the question and answer strings, and the type tag.

Splits

Split	Rows	Notes
`test.lance`	2,500	Public test slice from `lmms-lab/ChartQA`

The lmms-lab/ChartQA redistribution exposes the test split only. Train and validation live in the original ChartQA release; extend chartqa/dataprep.py with additional sources to add them.

Schema

Column	Type	Notes
`id`	`int64`	Row index within split (natural join key)
`image`	`large_binary`	Inline JPEG bytes
`image_id`	`string?`	Source does not assign explicit ids — null
`question_id`	`string?`	Source does not assign explicit ids — null
`question`	`string`	Natural-language question
`answers`	`list<string>`	Reference answer(s), typically a single string
`answer`	`string`	First reference answer — canonical, used for FTS
`type`	`string?`	Question type (`human` vs `augmented`)
`image_emb`	`fixed_size_list<float32, 512>`	CLIP image embedding (cosine-normalized)
`question_emb`	`fixed_size_list<float32, 512>`	CLIP text embedding of the question

Pre-built indices

IVF_PQ on image_emb — image-side vector search (cosine)
IVF_PQ on question_emb — text-side vector search (cosine)
INVERTED (FTS) on question and answer — keyword and hybrid search
BITMAP on type — fast filtering by question type

Why Lance?

Blazing Fast Random Access: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation.
Native Multimodal Support: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search.
Native Index Support: Lance comes with fast, on-disk, scalable vector and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
Efficient Data Evolution: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time.
Versatile Querying: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes.
Data Versioning: Every mutation commits a new version; previous versions remain intact on disk. Tags pin a snapshot by name, so retrieval systems and training runs can reproduce against an exact slice of history.

Load with `datasets.load_dataset`

You can load Lance datasets via the standard HuggingFace datasets interface, suitable when your pipeline already speaks Dataset / IterableDataset or you want a quick streaming sample.

import datasets

hf_ds = datasets.load_dataset("lance-format/chartqa-lance", split="test", streaming=True)
for row in hf_ds.take(3):
    print(row["question"], "->", row["answer"])

Load with LanceDB

LanceDB is the embedded retrieval library built on top of the Lance format (docs), and is the interface most users interact with. It wraps the dataset as a queryable table with search and filter builders, and is the entry point used by the Search, Curate, Evolve, Train, Versioning, and Materialize-a-subset sections below.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
tbl = db.open_table("test")
print(len(tbl))

Load with Lance

pylance is the Python binding for the Lance format and works directly with the format’s lower-level APIs. Reach for it when you want to inspect dataset internals — schema, scanner, fragments, and the list of pre-built indices.

import lance

ds = lance.dataset("hf://datasets/lance-format/chartqa-lance/data/test.lance")
print(ds.count_rows(), ds.schema.names)
print(ds.list_indices())

Tip — for production use, download locally first. Streaming from the Hub works for exploration, but heavy random access and ANN search are far faster against a local copy:
hf download lance-format/chartqa-lance --repo-type dataset --local-dir ./chartqa-lance
Then point Lance or LanceDB at ./chartqa-lance/data.

Search

The bundled IVF_PQ index on question_emb makes question-to-question retrieval a single call: encode a query with the same CLIP model used at ingest (ViT-B/32, cosine-normalized) and pass the resulting 512-d vector to tbl.search(...). The example below uses the question_emb already stored in row 42 as a runnable stand-in, so the snippet works without any model loaded.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
tbl = db.open_table("test")

seed = (
    tbl.search()
    .select(["question_emb", "question"])
    .limit(1)
    .offset(42)
    .to_list()[0]
)

hits = (
    tbl.search(seed["question_emb"], vector_column_name="question_emb")
    .metric("cosine")
    .select(["question", "answer", "type"])
    .limit(10)
    .to_list()
)
print("query:", seed["question"])
for r in hits:
    print(f"  [{r['type']}] {r['question'][:70]}  ->  {r['answer']}")

Swap vector_column_name="question_emb" for image_emb to do question-to-chart retrieval against the visual embedding instead — useful for finding charts whose layout is similar to a given prompt encoding. Because the dataset also ships an INVERTED index on question and answer, the same query can be issued as a hybrid search that combines the dense vector with a keyword query. LanceDB merges the two result lists and reranks them in a single call, which is useful when a phrase like “percentage” or “highest bar” must literally appear in the question but you still want CLIP to do the heavy lifting on semantic similarity.

hybrid_hits = (
    tbl.search(query_type="hybrid", vector_column_name="question_emb")
    .vector(seed["question_emb"])
    .text("percentage")
    .select(["question", "answer", "type"])
    .limit(10)
    .to_list()
)
for r in hybrid_hits:
    print(f"  [{r['type']}] {r['question'][:70]}  ->  {r['answer']}")

Tune metric, nprobes, and refine_factor on the vector side to trade recall against latency.

Curate

A typical curation pass combines a content predicate on the question text with a structural predicate on the question-type tag. Stacking both inside a single filtered scan keeps the result small and explicit, and the bounded .limit(500) makes it cheap to inspect before committing the subset to anything downstream. The example below collects human-authored questions that mention a percentage, which is a common slice for evaluating numeric-reasoning behaviour.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
tbl = db.open_table("test")

candidates = (
    tbl.search("percentage OR percent")
    .where("type = 'human'", prefilter=True)
    .select(["id", "question", "answer", "type"])
    .limit(500)
    .to_list()
)
print(f"{len(candidates)} candidates; first: {candidates[0]['question'][:80]}")

The result is a plain list of dictionaries, ready to inspect, persist as a manifest of row ids, or feed into the Evolve and Train workflows below. The image column is never read, so the network traffic for a 500-row candidate scan is dominated by question and answer text rather than chart JPEGs.

Evolve

Lance stores each column independently, so a new column can be appended without rewriting the existing data. The lightest form is a SQL expression: derive the new column from columns that already exist, and Lance computes it once and persists it. The example below adds answer_length, an is_yes_no flag, and an is_numeric flag, any of which can then be used directly in where clauses without recomputing the predicate on every query.

Note: Mutations require a local copy of the dataset, since the Hub mount is read-only. See the Materialize-a-subset section at the end of this card for a streaming pattern that downloads only the rows and columns you need, or use hf download to pull the full split first.

import lancedb

db = lancedb.connect("./chartqa-lance/data")  # local copy required for writes
tbl = db.open_table("test")

tbl.add_columns({
    "answer_length": "length(answer)",
    "is_yes_no": "lower(answer) IN ('yes', 'no')",
    "is_numeric": "regexp_match(answer, '^-?[0-9]+(\\.[0-9]+)?%?$') IS NOT NULL",
})

If the values you want to attach already live in another table (model predictions on the test set, reasoning-chain annotations, a difficulty score), merge them in by joining on the id column:

import pyarrow as pa

predictions = pa.table({
    "id": pa.array([0, 1, 2]),
    "pred_answer": pa.array(["12%", "Yes", "34"]),
    "is_correct": pa.array([True, True, False]),
})
tbl.merge(predictions, on="id")

The original columns and indices are untouched, so existing code that does not reference the new columns continues to work unchanged. New columns become visible to every reader as soon as the operation commits. For column values that require a Python computation (e.g., running a chart-OCR model over the image bytes), Lance provides a batch-UDF API — see the Lance data evolution docs.

Train

Projection lets a training loop read only the columns each step actually needs. LanceDB tables expose this through Permutation.identity(tbl).select_columns([...]), which plugs straight into the standard torch.utils.data.DataLoader so prefetching, shuffling, and batching behave as in any PyTorch pipeline. For fine-tuning a VLM on chart QA, project the chart bytes plus the question and answer; columns added in the Evolve section above cost nothing per batch until they are explicitly projected.

import lancedb
from lancedb.permutation import Permutation
from torch.utils.data import DataLoader

db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
tbl = db.open_table("test")

train_ds = Permutation.identity(tbl).select_columns(["image", "question", "answer"])
loader = DataLoader(train_ds, batch_size=32, shuffle=True, num_workers=4)

for batch in loader:
    # batch carries only the projected columns; decode the JPEG bytes,
    # tokenize the question/answer pair, forward, backward...
    ...

Switching feature sets is a configuration change: passing ["image_emb", "question_emb", "answer"] to select_columns(...) on the next run skips JPEG decoding entirely and reads only the cached 512-d vectors, which is the right shape for training a lightweight answer-classifier or a linear probe on top of frozen features.

Versioning

Every mutation to a Lance dataset, whether it adds a column, merges predictions, or builds an index, commits a new version. Previous versions remain intact on disk. You can list versions and inspect the history directly from the Hub copy; creating new tags requires a local copy since tags are writes.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
tbl = db.open_table("test")

print("Current version:", tbl.version)
print("History:", tbl.list_versions())
print("Tags:", tbl.tags.list())

Once you have a local copy, tag a version for reproducibility:

local_db = lancedb.connect("./chartqa-lance/data")
local_tbl = local_db.open_table("test")
local_tbl.tags.create("eval-v1", local_tbl.version)

A tagged version can be opened by name, or any version reopened by its number, against either the Hub copy or a local one:

tbl_v1 = db.open_table("test", version="eval-v1")
tbl_v5 = db.open_table("test", version=5)

Pinning supports two workflows. An evaluation harness locked to eval-v1 keeps producing comparable scores while the dataset evolves in parallel — newly added prediction columns or labels do not change what the tag resolves to. A training experiment pinned to the same tag can be rerun later against the exact same charts and questions, so changes in metrics reflect model changes rather than data drift. Neither workflow needs shadow copies or external manifest tracking.

Materialize a subset

Reads from the Hub are lazy, so exploratory queries only transfer the columns and row groups they touch. Mutating operations (Evolve, tag creation) need a writable backing store, and a training loop benefits from a local copy with fast random access. Both can be served by a subset of the dataset rather than the full split. The pattern is to stream a filtered query through .to_batches() into a new local table; only the projected columns and matching row groups cross the wire, and the bytes never fully materialize in Python memory.

import lancedb

remote_db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
remote_tbl = remote_db.open_table("test")

batches = (
    remote_tbl.search("percentage OR percent")
    .where("type = 'human'")
    .select(["id", "image", "question", "answer", "type", "image_emb", "question_emb"])
    .to_batches()
)

local_db = lancedb.connect("./chartqa-human-subset")
local_db.create_table("test", batches)

The resulting ./chartqa-human-subset is a first-class LanceDB database. Every snippet in the Evolve, Train, and Versioning sections above works against it by swapping hf://datasets/lance-format/chartqa-lance/data for ./chartqa-human-subset.

Source & license

Converted from lmms-lab/ChartQA. The original ChartQA dataset is released under the GNU GPL-3.0 license by Masry et al.

Citation

@inproceedings{masry2022chartqa,
  title={ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning},
  author={Masry, Ahmed and Long, Do Xuan and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
  year={2022}
}

View on Hugging Face

​Key features

​Splits

​Schema

​Pre-built indices

​Why Lance?

​Load with datasets.load_dataset

​Load with LanceDB

​Load with Lance

​Search

​Curate

​Evolve

​Train

​Versioning

​Materialize a subset

​Source & license

​Citation