MS MARCO v2.1

View on Hugging Face

Source dataset card and downloadable files for lance-format/ms-marco-v2.1-lance.

Lance-formatted version of MS MARCO v2.1 — Microsoft’s machine reading comprehension benchmark — with MiniLM query embeddings stored inline alongside the candidate passages and human-written answers.

Why this version?

One self-contained Lance dataset with ~900 k queries; each row is a query, the 10 candidate passages retrieved by Bing, the relevance flags, and the human-written reference answers.
Pre-computed query embeddings (sentence-transformers/all-MiniLM-L6-v2, 384-dim, L2-normalized) with an IVF_PQ index — semantic query lookup without re-embedding.
Full-text inverted indices on the query and the first selected passage.
Designed for both retrieval research (use the index) and RAG / answer eval (use the passage list + answers).

Splits

Split	Rows
`train.lance`	808,731
`validation.lance`	101,093

Schema

Column	Type	Notes
`query_id`	`int64`	MS MARCO query id
`query`	`string`	The user’s natural-language query
`query_type`	`string`	One of `DESCRIPTION`, `NUMERIC`, `ENTITY`, `LOCATION`, `PERSON`
`answers`	`list<string>`	Human-written reference answers
`well_formed_answers`	`list<string>`	Reference answers re-written as full sentences
`passage_text`	`list<string>`	Up to 10 candidate passages
`passage_url`	`list<string>`	Source URLs for each candidate
`passage_is_selected`	`list<int8>`	`1` if Bing labelled the passage relevant
`selected_passage`	`string?`	First relevant passage (null if none)
`query_emb`	`fixed_size_list<float32, 384>`	MiniLM embedding of `query` (cosine-normalized)

Pre-built indices

IVF_PQ on query_emb — metric=cosine
INVERTED on query and selected_passage
BTREE on query_id
BITMAP on query_type

Quick start

import lance

ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())

Load with LanceDB

These tables can also be consumed by LanceDB, the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.

import lancedb

db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data")
tbl = db.open_table("validation")
print(f"LanceDB table opened with {len(tbl)} queries")

Semantic query lookup

import lance
import pyarrow as pa
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["how to compute determinant of a 3x3 matrix"], normalize_embeddings=True)[0]

ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/validation.lance")
emb_field = ds.schema.field("query_emb")
hits = ds.scanner(
    nearest={"column": "query_emb", "q": pa.array([q.tolist()], type=emb_field.type)[0], "k": 5, "nprobes": 16, "refine_factor": 30},
    columns=["query_id", "query", "selected_passage", "answers"],
).to_table().to_pylist()
for h in hits:
    print(h["query"])
    print("  selected:", (h.get("selected_passage") or "")[:120])

LanceDB semantic query lookup

import lancedb
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
q = encoder.encode(["how to compute determinant of a 3x3 matrix"], normalize_embeddings=True)[0]

db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data")
tbl = db.open_table("validation")

results = (
    tbl.search(q.tolist(), vector_column_name="query_emb")
    .metric("cosine")
    .select(["query_id", "query", "selected_passage", "answers"])
    .limit(5)
    .to_list()
)

LanceDB full-text search

import lancedb

db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data")
tbl = db.open_table("validation")

results = (
    tbl.search("determinant matrix")
    .select(["query", "selected_passage"])
    .limit(10)
    .to_list()
)

Get all candidate passages for a query

import lance
ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/validation.lance")
row = ds.scanner(filter="query_id = 1185869", columns=["query", "passage_text", "passage_is_selected"]).to_table().to_pylist()[0]
for text, sel in zip(row["passage_text"], row["passage_is_selected"]):
    print("[selected]" if sel else "[other]", text[:120])

Filter by query_type

ds = lance.dataset("hf://datasets/lance-format/ms-marco-v2.1-lance/data/train.lance")
numeric = ds.scanner(filter="query_type = 'NUMERIC'", columns=["query"], limit=5).to_table()

Filter by query_type with LanceDB

import lancedb

db = lancedb.connect("hf://datasets/lance-format/ms-marco-v2.1-lance/data")
tbl = db.open_table("train")
numeric = (
    tbl.search()
    .where("query_type = 'NUMERIC'")
    .select(["query"])
    .limit(5)
    .to_list()
)

Why Lance?

One dataset carries queries + passages + answers + embeddings + indices — no sidecar files.
On-disk vector and full-text indices live next to the data, so search works on local copies and on the Hub.
Schema evolution: add columns (alternate embeddings, generated answers, model predictions) without rewriting the data.

Source & license

Converted from microsoft/ms_marco (v2.1). MS MARCO is released under the MIT license.

Citation

@article{nguyen2016ms,
  title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset},
  author={Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
  journal={arXiv preprint arXiv:1611.09268},
  year={2016}
}

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

View on Hugging Face

Why this version?

Splits

Schema

Pre-built indices

Quick start

Load with LanceDB

Semantic query lookup

LanceDB semantic query lookup

LanceDB full-text search

Get all candidate passages for a query

Filter by query_type

Filter by query_type with LanceDB

Why Lance?

Source & license

Citation

Overview

Image Classification

Object Detection & Segmentation

Image Retrieval

Visual Question Answering

Text QA

Text Corpora

Speech

Video

Robotics

Documentation Index

View on Hugging Face

​Why this version?

​Splits

​Schema

​Pre-built indices

​Quick start

​Load with LanceDB

​Semantic query lookup

​LanceDB semantic query lookup

​LanceDB full-text search

​Get all candidate passages for a query

​Filter by query_type

​Filter by query_type with LanceDB

​Why Lance?

​Source & license

​Citation

Why this version?

Splits

Schema

Pre-built indices

Quick start

Load with LanceDB

Semantic query lookup

LanceDB semantic query lookup

LanceDB full-text search

Get all candidate passages for a query

Filter by query_type

Filter by query_type with LanceDB

Why Lance?

Source & license

Citation