Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt

Use this file to discover all available pages before exploring further.

https://mintcdn.com/lancedb-bcbb4faf/6L0IRVkfdlgMU1Pw/static/assets/logo/huggingface-logo.svg?fit=max&auto=format&n=6L0IRVkfdlgMU1Pw&q=85&s=da940a105a40440f0cd1224d3fa4ae52

View on Hugging Face

Source dataset card and downloadable files for lance-format/docvqa-lance.
Lance-formatted version of DocVQA — VQA over document images (industry / government scans, multi-page reports, forms, receipts) — sourced from lmms-lab/DocVQA (DocVQA config).

Splits

SplitRows
validation.lance5,349
test.lance5,188

Schema

ColumnTypeNotes
idint64Row index within split
imagelarge_binaryInline JPEG bytes (page image)
image_idstring?DocVQA docId (alias)
question_idstring?DocVQA questionId
questionstringNatural-language question
answerslist<string>Reference answer span(s)
answerstringFirst reference answer (FTS target)
doc_idstring?DocVQA document id
ucsf_document_idstring?UCSF Industry Documents Library id
ucsf_document_page_nostring?Page number within the source document
data_splitstring?Original split label from the source
question_typeslist<string>DocVQA question-type tags (form, figure, table, …)
image_embfixed_size_list<float32, 512>CLIP image embedding (cosine-normalized)
question_embfixed_size_list<float32, 512>CLIP text embedding of the question

Pre-built indices

  • IVF_PQ on image_emb and question_embmetric=cosine
  • INVERTED (FTS) on question and answer
  • BTREE on image_id, question_id, doc_id
  • LABEL_LIST on question_types

Quick start

import lance
ds = lance.dataset("hf://datasets/lance-format/docvqa-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())

Load with LanceDB

These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/docvqa-lance/data")
tbl = db.open_table("validation")
print(f"LanceDB table opened with {len(tbl)} document-question pairs")
import lancedb

db = lancedb.connect("hf://datasets/lance-format/docvqa-lance/data")
tbl = db.open_table("validation")

ref = tbl.search().limit(1).select(["question_emb", "question"]).to_list()[0]
query_embedding = ref["question_emb"]

results = (
    tbl.search(query_embedding, vector_column_name="question_emb")
    .metric("cosine")
    .select(["question", "answer"])
    .limit(5)
    .to_list()
)
import lancedb

db = lancedb.connect("hf://datasets/lance-format/docvqa-lance/data")
tbl = db.open_table("validation")

results = (
    tbl.search("invoice total")
    .select(["question", "answer"])
    .limit(10)
    .to_list()
)

Filter by question type

import lance
ds = lance.dataset("hf://datasets/lance-format/docvqa-lance/data/validation.lance")
forms = ds.scanner(
    filter="array_has_any(question_types, ['form'])",
    columns=["question", "answer"],
    limit=5,
).to_table()

Filter with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/docvqa-lance/data")
tbl = db.open_table("validation")
forms = (
    tbl.search()
    .where("array_has_any(question_types, ['form'])")
    .select(["question", "answer"])
    .limit(5)
    .to_list()
)

Source & license

Converted from lmms-lab/DocVQA. DocVQA is released under the MIT license; the underlying documents come from the UCSF Industry Documents Library — review their access conditions before redistribution.

Citation

@inproceedings{mathew2021docvqa,
  title={DocVQA: A Dataset for VQA on Document Images},
  author={Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, CV},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2021}
}