Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt

Use this file to discover all available pages before exploring further.

https://mintcdn.com/lancedb-bcbb4faf/6L0IRVkfdlgMU1Pw/static/assets/logo/huggingface-logo.svg?fit=max&auto=format&n=6L0IRVkfdlgMU1Pw&q=85&s=da940a105a40440f0cd1224d3fa4ae52

View on Hugging Face

Source dataset card and downloadable files for lance-format/natural-questions-val-lance.
Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries with their full Wikipedia articles and 1–5 annotator labels per question. Sourced from google-research-datasets/natural_questions.
The NQ train split is 143 GB (307,373 rows); it is intentionally not bundled here. Add it via natural_questions/dataprep.py --splits train once disk + bandwidth allow.

Splits

SplitRows
validation.lance7,830

Schema

ColumnTypeNotes
idstringNQ example id
questionstringOriginal Google search query
document_titlestringWikipedia article title
document_urlstringWikipedia article URL
document_htmllarge_binaryFull HTML of the article (inline; UTF-8 bytes)
short_answerslist<string>Deduped short-answer spans across all annotators
num_short_answersint32Total annotator spans (incl. duplicates)
has_short_answerboolAt least one annotator provided a short-answer span
has_long_answerboolAt least one annotator selected a long-answer candidate
yes_no_answerstringYES / NO / NONE — majority vote across annotators
question_embfixed_size_list<float32, 384>sentence-transformers all-MiniLM-L6-v2 (cosine-normalized)

Pre-built indices

  • IVF_PQ on question_embmetric=cosine
  • INVERTED (FTS) on question
  • BTREE on id, document_title
  • BITMAP on yes_no_answer, has_short_answer, has_long_answer

Quick start

import lance
ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())

Load with LanceDB

These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")
print(f"LanceDB table opened with {len(tbl)} questions")
import lancedb
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["who wrote the declaration of independence"], normalize_embeddings=True)[0]

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")

results = (
    tbl.search(q.tolist(), vector_column_name="question_emb")
    .metric("cosine")
    .select(["question", "short_answers", "document_title"])
    .limit(5)
    .to_list()
)
import lancedb

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")

results = (
    tbl.search("declaration of independence")
    .select(["question", "document_title"])
    .limit(10)
    .to_list()
)

Get only questions with short-answer spans

import lance
ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance")
short = ds.scanner(
    filter="has_short_answer = true",
    columns=["question", "short_answers", "document_title"],
    limit=10,
).to_table().to_pylist()

Filter with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")
short = (
    tbl.search()
    .where("has_short_answer = true")
    .select(["question", "short_answers", "document_title"])
    .limit(10)
    .to_list()
)

Read the full Wikipedia HTML for one question

import lance
ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance")
row = ds.take([0], columns=["question", "document_html", "document_url"]).to_pylist()[0]
print(row["question"], "->", row["document_url"])
print(row["document_html"][:500].decode("utf-8", errors="replace"))

Source & license

Converted from google-research-datasets/natural_questions. NQ is released under CC BY-SA 3.0 (matching the Wikipedia source).

Citation

@article{kwiatkowski2019natural,
  title={Natural Questions: A Benchmark for Question Answering Research},
  author={Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav},
  journal={Transactions of the Association for Computational Linguistics},
  year={2019}
}