Documentation Index
Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
Use this file to discover all available pages before exploring further.
View on Hugging Face
Source dataset card and downloadable files for lance-format/natural-questions-val-lance.
Lance-formatted version of the Natural Questions validation split — 7,830 real Google search queries with their full Wikipedia articles and 1–5 annotator labels per question. Sourced from google-research-datasets/natural_questions.
The NQ train split is 143 GB (307,373 rows); it is intentionally not bundled here. Add it via natural_questions/dataprep.py --splits train once disk + bandwidth allow.
Splits
| Split | Rows |
|---|
validation.lance | 7,830 |
Schema
| Column | Type | Notes |
|---|
id | string | NQ example id |
question | string | Original Google search query |
document_title | string | Wikipedia article title |
document_url | string | Wikipedia article URL |
document_html | large_binary | Full HTML of the article (inline; UTF-8 bytes) |
short_answers | list<string> | Deduped short-answer spans across all annotators |
num_short_answers | int32 | Total annotator spans (incl. duplicates) |
has_short_answer | bool | At least one annotator provided a short-answer span |
has_long_answer | bool | At least one annotator selected a long-answer candidate |
yes_no_answer | string | YES / NO / NONE — majority vote across annotators |
question_emb | fixed_size_list<float32, 384> | sentence-transformers all-MiniLM-L6-v2 (cosine-normalized) |
Pre-built indices
IVF_PQ on question_emb — metric=cosine
INVERTED (FTS) on question
BTREE on id, document_title
BITMAP on yes_no_answer, has_short_answer, has_long_answer
Quick start
import lance
ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())
Load with LanceDB
These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb
db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")
print(f"LanceDB table opened with {len(tbl)} questions")
LanceDB semantic question search
import lancedb
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["who wrote the declaration of independence"], normalize_embeddings=True)[0]
db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")
results = (
tbl.search(q.tolist(), vector_column_name="question_emb")
.metric("cosine")
.select(["question", "short_answers", "document_title"])
.limit(5)
.to_list()
)
LanceDB full-text search
import lancedb
db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")
results = (
tbl.search("declaration of independence")
.select(["question", "document_title"])
.limit(10)
.to_list()
)
Get only questions with short-answer spans
import lance
ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance")
short = ds.scanner(
filter="has_short_answer = true",
columns=["question", "short_answers", "document_title"],
limit=10,
).to_table().to_pylist()
Filter with LanceDB
import lancedb
db = lancedb.connect("hf://datasets/lance-format/natural-questions-val-lance/data")
tbl = db.open_table("validation")
short = (
tbl.search()
.where("has_short_answer = true")
.select(["question", "short_answers", "document_title"])
.limit(10)
.to_list()
)
Read the full Wikipedia HTML for one question
import lance
ds = lance.dataset("hf://datasets/lance-format/natural-questions-val-lance/data/validation.lance")
row = ds.take([0], columns=["question", "document_html", "document_url"]).to_pylist()[0]
print(row["question"], "->", row["document_url"])
print(row["document_html"][:500].decode("utf-8", errors="replace"))
Source & license
Converted from google-research-datasets/natural_questions. NQ is released under CC BY-SA 3.0 (matching the Wikipedia source).
Citation
@article{kwiatkowski2019natural,
title={Natural Questions: A Benchmark for Question Answering Research},
author={Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav},
journal={Transactions of the Association for Computational Linguistics},
year={2019}
}