Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt

Use this file to discover all available pages before exploring further.

https://mintcdn.com/lancedb-bcbb4faf/6L0IRVkfdlgMU1Pw/static/assets/logo/huggingface-logo.svg?fit=max&auto=format&n=6L0IRVkfdlgMU1Pw&q=85&s=da940a105a40440f0cd1224d3fa4ae52

View on Hugging Face

Source dataset card and downloadable files for lance-format/squad-v2-lance.
Lance-formatted version of SQuAD v2 — Stanford Question Answering Dataset, version 2 — with MiniLM sentence embeddings stored inline alongside the questions, contexts, and answers.

Why this version?

  • One self-contained Lance dataset with 130k+ Wikipedia-grounded questions and reference answers.
  • Pre-computed text embeddings (sentence-transformers/all-MiniLM-L6-v2, 384-dim, L2-normalized) on the question column with an IVF_PQ index — instant semantic question retrieval.
  • Full-text inverted indices on both question and context for keyword search.
  • BITMAP on is_impossible for fast filtering between answerable and unanswerable questions.

Splits

SplitRows
train.lance130,319
validation.lance11,873

Schema

ColumnTypeNotes
idstringSQuAD question id
titlestringWikipedia article title
contextstringParagraph the question was generated from
questionstringThe question text
answerslist<string>Accepted answer spans (empty for impossible questions)
answer_startslist<int32>Character offsets of each answer within context
is_impossiblebooltrue for SQuAD 2.0 unanswerable questions
question_embfixed_size_list<float32, 384>MiniLM embedding of question (cosine-normalized)

Pre-built indices

  • IVF_PQ on question_embmetric=cosine
  • INVERTED on question and context
  • BTREE on id and title
  • BITMAP on is_impossible

Quick start

import lance

ds = lance.dataset("hf://datasets/lance-format/squad-v2-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())

Load with LanceDB

These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/squad-v2-lance/data")
tbl = db.open_table("validation")
print(f"LanceDB table opened with {len(tbl)} questions")

Semantic question retrieval

import lance
import pyarrow as pa
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q_vec = encoder.encode(["what year was the eiffel tower built?"], normalize_embeddings=True)[0]

ds = lance.dataset("hf://datasets/lance-format/squad-v2-lance/data/train.lance")
emb_field = ds.schema.field("question_emb")
query = pa.array([q_vec.tolist()], type=emb_field.type)

hits = ds.scanner(
    nearest={"column": "question_emb", "q": query[0], "k": 10, "nprobes": 16, "refine_factor": 30},
    columns=["id", "title", "question", "answers"],
).to_table().to_pylist()

LanceDB semantic question retrieval

import lancedb
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q_vec = encoder.encode(["what year was the eiffel tower built?"], normalize_embeddings=True)[0]

db = lancedb.connect("hf://datasets/lance-format/squad-v2-lance/data")
tbl = db.open_table("train")

results = (
    tbl.search(q_vec.tolist(), vector_column_name="question_emb")
    .metric("cosine")
    .select(["id", "title", "question", "answers"])
    .limit(10)
    .to_list()
)

Full-text search on contexts

ds = lance.dataset("hf://datasets/lance-format/squad-v2-lance/data/train.lance")
hits = ds.scanner(
    full_text_query="great pyramid of giza",
    columns=["title", "question", "context"],
    limit=5,
).to_table().to_pylist()
import lancedb

db = lancedb.connect("hf://datasets/lance-format/squad-v2-lance/data")
tbl = db.open_table("train")

results = (
    tbl.search("great pyramid of giza")
    .select(["title", "question", "context"])
    .limit(5)
    .to_list()
)

Filter answerable vs impossible questions

ds = lance.dataset("hf://datasets/lance-format/squad-v2-lance/data/validation.lance")
impossible = ds.scanner(filter="is_impossible = true", columns=["question"], limit=5).to_table()

Filter with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/squad-v2-lance/data")
tbl = db.open_table("validation")
impossible = (
    tbl.search()
    .where("is_impossible = true")
    .select(["question"])
    .limit(5)
    .to_list()
)

Why Lance?

  • One dataset carries questions + contexts + answers + embeddings + indices — no sidecar files.
  • On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub.
  • Schema evolution: add columns (alternate embeddings, model predictions, task labels) without rewriting the data.

Source & license

Converted from rajpurkar/squad_v2. SQuAD v2 is released under CC BY-SA 4.0.

Citation

@article{rajpurkar2018know,
  title={Know What You Don't Know: Unanswerable Questions for SQuAD},
  author={Rajpurkar, Pranav and Jia, Robin and Liang, Percy},
  journal={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers)},
  year={2018},
}