Documentation Index
Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
Use this file to discover all available pages before exploring further.
View on Hugging Face
Source dataset card and downloadable files for lance-format/hotpotqa-distractor-lance.
Lance-formatted version of HotpotQA — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs — using the distractor config (10 candidate paragraphs per question, including gold + 8 distractors). Sourced from hotpot_qa.
Splits
| Split | Rows |
|---|
train.lance | 90,447 |
validation.lance | 7,405 |
Schema
| Column | Type | Notes |
|---|
id | string | HotpotQA question id |
question | string | The question |
answer | string | Reference short answer (yes / no / span) |
type | string? | bridge or comparison |
level | string? | easy / medium / hard |
supporting_titles | list<string> | Wikipedia titles that contain gold facts |
supporting_sent_ids | list<int32> | Sentence indices into those titles |
context_titles | list<string> | All 10 paragraph titles (gold + distractors) |
context_sentences | list<list<string>> | Sentences per paragraph |
context_text | string | Flattened paragraphs — feeds the FTS index |
num_supporting_facts | int32 | Number of gold supporting facts |
question_emb | fixed_size_list<float32, 384> | sentence-transformers all-MiniLM-L6-v2 (cosine-normalized) |
Pre-built indices
IVF_PQ on question_emb — metric=cosine
INVERTED (FTS) on question and context_text
BTREE on id, answer
BITMAP on type, level
Quick start
import lance
ds = lance.dataset("hf://datasets/lance-format/hotpotqa-distractor-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())
Load with LanceDB
These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb
db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data")
tbl = db.open_table("validation")
print(f"LanceDB table opened with {len(tbl)} questions")
Multi-hop semantic search
import lance, pyarrow as pa
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["which actor played in both inception and dunkirk"], normalize_embeddings=True)[0]
ds = lance.dataset("hf://datasets/lance-format/hotpotqa-distractor-lance/data/train.lance")
emb_field = ds.schema.field("question_emb")
hits = ds.scanner(
nearest={"column": "question_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 5},
columns=["question", "answer", "supporting_titles"],
).to_table().to_pylist()
LanceDB semantic search
import lancedb
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["which actor played in both inception and dunkirk"], normalize_embeddings=True)[0]
db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data")
tbl = db.open_table("train")
results = (
tbl.search(q.tolist(), vector_column_name="question_emb")
.metric("cosine")
.select(["question", "answer", "supporting_titles"])
.limit(5)
.to_list()
)
LanceDB full-text search
import lancedb
db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data")
tbl = db.open_table("train")
results = (
tbl.search("inception dunkirk")
.select(["question", "answer"])
.limit(10)
.to_list()
)
Filter by question type
import lance
ds = lance.dataset("hf://datasets/lance-format/hotpotqa-distractor-lance/data/validation.lance")
hard_compare = ds.scanner(
filter="type = 'comparison' AND level = 'hard'",
columns=["question", "answer"],
limit=10,
).to_table()
Filter with LanceDB
import lancedb
db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data")
tbl = db.open_table("validation")
hard_compare = (
tbl.search()
.where("type = 'comparison' AND level = 'hard'")
.select(["question", "answer"])
.limit(10)
.to_list()
)
Source & license
Converted from hotpot_qa (distractor config). HotpotQA is released under CC BY-SA 4.0.
Citation
@inproceedings{yang2018hotpotqa,
title={HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering},
author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
year={2018}
}