Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt

Use this file to discover all available pages before exploring further.

https://mintcdn.com/lancedb-bcbb4faf/6L0IRVkfdlgMU1Pw/static/assets/logo/huggingface-logo.svg?fit=max&auto=format&n=6L0IRVkfdlgMU1Pw&q=85&s=da940a105a40440f0cd1224d3fa4ae52

View on Hugging Face

Source dataset card and downloadable files for lance-format/hotpotqa-distractor-lance.
Lance-formatted version of HotpotQA — multi-hop reading-comprehension questions where each answer requires combining facts from two Wikipedia paragraphs — using the distractor config (10 candidate paragraphs per question, including gold + 8 distractors). Sourced from hotpot_qa.

Splits

SplitRows
train.lance90,447
validation.lance7,405

Schema

ColumnTypeNotes
idstringHotpotQA question id
questionstringThe question
answerstringReference short answer (yes / no / span)
typestring?bridge or comparison
levelstring?easy / medium / hard
supporting_titleslist<string>Wikipedia titles that contain gold facts
supporting_sent_idslist<int32>Sentence indices into those titles
context_titleslist<string>All 10 paragraph titles (gold + distractors)
context_sentenceslist<list<string>>Sentences per paragraph
context_textstringFlattened paragraphs — feeds the FTS index
num_supporting_factsint32Number of gold supporting facts
question_embfixed_size_list<float32, 384>sentence-transformers all-MiniLM-L6-v2 (cosine-normalized)

Pre-built indices

  • IVF_PQ on question_embmetric=cosine
  • INVERTED (FTS) on question and context_text
  • BTREE on id, answer
  • BITMAP on type, level

Quick start

import lance
ds = lance.dataset("hf://datasets/lance-format/hotpotqa-distractor-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())

Load with LanceDB

These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data")
tbl = db.open_table("validation")
print(f"LanceDB table opened with {len(tbl)} questions")
import lance, pyarrow as pa
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["which actor played in both inception and dunkirk"], normalize_embeddings=True)[0]

ds = lance.dataset("hf://datasets/lance-format/hotpotqa-distractor-lance/data/train.lance")
emb_field = ds.schema.field("question_emb")
hits = ds.scanner(
    nearest={"column": "question_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 5},
    columns=["question", "answer", "supporting_titles"],
).to_table().to_pylist()
import lancedb
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["which actor played in both inception and dunkirk"], normalize_embeddings=True)[0]

db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data")
tbl = db.open_table("train")

results = (
    tbl.search(q.tolist(), vector_column_name="question_emb")
    .metric("cosine")
    .select(["question", "answer", "supporting_titles"])
    .limit(5)
    .to_list()
)
import lancedb

db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data")
tbl = db.open_table("train")

results = (
    tbl.search("inception dunkirk")
    .select(["question", "answer"])
    .limit(10)
    .to_list()
)

Filter by question type

import lance
ds = lance.dataset("hf://datasets/lance-format/hotpotqa-distractor-lance/data/validation.lance")
hard_compare = ds.scanner(
    filter="type = 'comparison' AND level = 'hard'",
    columns=["question", "answer"],
    limit=10,
).to_table()

Filter with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/hotpotqa-distractor-lance/data")
tbl = db.open_table("validation")
hard_compare = (
    tbl.search()
    .where("type = 'comparison' AND level = 'hard'")
    .select(["question", "answer"])
    .limit(10)
    .to_list()
)

Source & license

Converted from hotpot_qa (distractor config). HotpotQA is released under CC BY-SA 4.0.

Citation

@inproceedings{yang2018hotpotqa,
  title={HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering},
  author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
  booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
  year={2018}
}