Natural Questions Validation

View on Hugging Face

Source dataset card and downloadable files for lance-format/natural-questions-val-lance.

Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries with their full Wikipedia articles and 1–5 annotator labels per question. Sourced from google-research-datasets/natural_questions.

The NQ train split is 143 GB (307,373 rows); it is intentionally not bundled here. Add it via natural_questions/dataprep.py --splits train once disk + bandwidth allow.

Splits

Split	Rows
`validation.lance`	7,830

Schema

Column	Type	Notes
`id`	`string`	NQ example id
`question`	`string`	Original Google search query
`document_title`	`string`	Wikipedia article title
`document_url`	`string`	Wikipedia article URL
`document_html`	`large_binary`	Full HTML of the article (inline; UTF-8 bytes)
`short_answers`	`list<string>`	Deduped short-answer spans across all annotators
`num_short_answers`	`int32`	Total annotator spans (incl. duplicates)
`has_short_answer`	`bool`	At least one annotator provided a short-answer span
`has_long_answer`	`bool`	At least one annotator selected a long-answer candidate
`yes_no_answer`	`string`	`YES` / `NO` / `NONE` — majority vote across annotators
`question_emb`	`fixed_size_list<float32, 384>`	sentence-transformers `all-MiniLM-L6-v2` (cosine-normalized)

Pre-built indices

IVF_PQ on question_emb — metric=cosine
INVERTED (FTS) on question
BTREE on id, document_title
BITMAP on yes_no_answer, has_short_answer, has_long_answer

Quick start

import lance
ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())

Load with LanceDB

These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")
print(f"LanceDB table opened with {len(tbl)} questions")

LanceDB semantic question search

import lancedb
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["who wrote the declaration of independence"], normalize_embeddings=True)[0]

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")

results = (
    tbl.search(q.tolist(), vector_column_name="question_emb")
    .metric("cosine")
    .select(["question", "short_answers", "document_title"])
    .limit(5)
    .to_list()
)

LanceDB full-text search

import lancedb

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")

results = (
    tbl.search("declaration of independence")
    .select(["question", "document_title"])
    .limit(10)
    .to_list()
)

Get only questions with short-answer spans

import lance
ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance")
short = ds.scanner(
    filter="has_short_answer = true",
    columns=["question", "short_answers", "document_title"],
    limit=10,
).to_table().to_pylist()

Filter with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")
short = (
    tbl.search()
    .where("has_short_answer = true")
    .select(["question", "short_answers", "document_title"])
    .limit(10)
    .to_list()
)

Read the full Wikipedia HTML for one question

import lance
ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance")
row = ds.take([0], columns=["question", "document_html", "document_url"]).to_pylist()[0]
print(row["question"], "->", row["document_url"])
print(row["document_html"][:500].decode("utf-8", errors="replace"))

Source & license

Converted from google-research-datasets/natural_questions. NQ is released under CC BY-SA 3.0 (matching the Wikipedia source).

Citation

@article{kwiatkowski2019natural,
  title={Natural Questions: A Benchmark for Question Answering Research},
  author={Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav},
  journal={Transactions of the Association for Computational Linguistics},
  year={2019}
}

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Natural Questions Validation

View on Hugging Face

Splits

Schema

Pre-built indices

Quick start

Load with LanceDB

LanceDB semantic question search

LanceDB full-text search

Get only questions with short-answer spans

Filter with LanceDB

Read the full Wikipedia HTML for one question

Source & license

Citation

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Documentation Index

View on Hugging Face

​Splits

​Schema

​Pre-built indices

​Quick start

​Load with LanceDB

​LanceDB semantic question search

​LanceDB full-text search

​Get only questions with short-answer spans

​Filter with LanceDB

​Read the full Wikipedia HTML for one question

​Source & license

​Citation

Splits

Schema

Pre-built indices

Quick start

Load with LanceDB

LanceDB semantic question search

LanceDB full-text search

Get only questions with short-answer spans

Filter with LanceDB

Read the full Wikipedia HTML for one question

Source & license

Citation