Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt

Use this file to discover all available pages before exploring further.

https://mintcdn.com/lancedb-bcbb4faf/6L0IRVkfdlgMU1Pw/static/assets/logo/huggingface-logo.svg?fit=max&auto=format&n=6L0IRVkfdlgMU1Pw&q=85&s=da940a105a40440f0cd1224d3fa4ae52

View on Hugging Face

Source dataset card and downloadable files for lance-format/ms-marco-v2.1-lance.
Lance-formatted version of MS MARCO v2.1 — Microsoft’s machine reading comprehension benchmark — with MiniLM query embeddings stored inline alongside the candidate passages and human-written answers.

Why this version?

  • One self-contained Lance dataset with ~900 k queries; each row is a query, the 10 candidate passages retrieved by Bing, the relevance flags, and the human-written reference answers.
  • Pre-computed query embeddings (sentence-transformers/all-MiniLM-L6-v2, 384-dim, L2-normalized) with an IVF_PQ index — semantic query lookup without re-embedding.
  • Full-text inverted indices on the query and the first selected passage.
  • Designed for both retrieval research (use the index) and RAG / answer eval (use the passage list + answers).

Splits

SplitRows
train.lance808,731
validation.lance101,093

Schema

ColumnTypeNotes
query_idint64MS MARCO query id
querystringThe user’s natural-language query
query_typestringOne of DESCRIPTION, NUMERIC, ENTITY, LOCATION, PERSON
answerslist<string>Human-written reference answers
well_formed_answerslist<string>Reference answers re-written as full sentences
passage_textlist<string>Up to 10 candidate passages
passage_urllist<string>Source URLs for each candidate
passage_is_selectedlist<int8>1 if Bing labelled the passage relevant
selected_passagestring?First relevant passage (null if none)
query_embfixed_size_list<float32, 384>MiniLM embedding of query (cosine-normalized)

Pre-built indices

  • IVF_PQ on query_embmetric=cosine
  • INVERTED on query and selected_passage
  • BTREE on query_id
  • BITMAP on query_type

Quick start

import lance

ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())

Load with LanceDB

These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
import lancedb

db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data")
tbl = db.open_table("validation")
print(f"LanceDB table opened with {len(tbl)} queries")

Semantic query lookup

import lance
import pyarrow as pa
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["how to compute determinant of a 3x3 matrix"], normalize_embeddings=True)[0]

ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/validation.lance")
emb_field = ds.schema.field("query_emb")
hits = ds.scanner(
    nearest={"column": "query_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 5, "nprobes": 16, "refine_factor": 30},
    columns=["query_id", "query", "selected_passage", "answers"],
).to_table().to_pylist()
for h in hits:
    print(h["query"])
    print("  selected:", (h.get("selected_passage") or "")[:120])

LanceDB semantic query lookup

import lancedb
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["how to compute determinant of a 3x3 matrix"], normalize_embeddings=True)[0]

db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data")
tbl = db.open_table("validation")

results = (
    tbl.search(q.tolist(), vector_column_name="query_emb")
    .metric("cosine")
    .select(["query_id", "query", "selected_passage", "answers"])
    .limit(5)
    .to_list()
)
import lancedb

db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data")
tbl = db.open_table("validation")

results = (
    tbl.search("determinant matrix")
    .select(["query", "selected_passage"])
    .limit(10)
    .to_list()
)

Get all candidate passages for a query

import lance
ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/validation.lance")
row = ds.scanner(filter="query_id = 1185869", columns=["query", "passage_text", "passage_is_selected"]).to_table().to_pylist()[0]
for text, sel in zip(row["passage_text"], row["passage_is_selected"]):
    print("[selected]" if sel else "[other]", text[:120])

Filter by query_type

ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/train.lance")
numeric = ds.scanner(filter="query_type = 'NUMERIC'", columns=["query"], limit=5).to_table()

Filter by query_type with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data")
tbl = db.open_table("train")
numeric = (
    tbl.search()
    .where("query_type = 'NUMERIC'")
    .select(["query"])
    .limit(5)
    .to_list()
)

Why Lance?

  • One dataset carries queries + passages + answers + embeddings + indices — no sidecar files.
  • On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub.
  • Schema evolution: add columns (alternate embeddings, generated answers, model predictions) without rewriting the data.

Source & license

Converted from microsoft/ms_marco (v2.1). MS MARCO is released under the MIT license.

Citation

@article{nguyen2016ms,
  title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset},
  author={Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
  journal={arXiv preprint arXiv:1611.09268},
  year={2016}
}