Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt

Use this file to discover all available pages before exploring further.

This example walks through fine-tuning an autonomous vehicle (AV) perception model on targeted failure-mode slices of BDD100K — riders, nighttime pedestrians, and distant pedestrians — using LanceDB as a single multimodal table from raw JPEG bytes through to the PyTorch training loop. The full pipeline lives in the lancedb/training repository. This page focuses on the parts most relevant to training: defining curated splits as materialized views, loading them through the Permutation API, and pinning checkpoints to an exact data version.

What you get

Fine-tuning Faster R-CNN ResNet50 FPN v2 for 10 epochs on each curated slice (batch size 64, AMP, A100), starting from the same COCO-pretrained checkpoint and evaluating on the matching validation view:
Failure modeMetricBaseline (COCO)Fine-tunedΔ%
Nighttime pedestrianmAP@0.50.40250.5192+29.0%
Recall0.59230.7570+27.8%
RidermAP@0.50.55630.6676+20.0%
Recall0.67880.7847+15.6%
Distant pedestrianmAP@0.50.47460.5788+22.0%
Recall0.67940.8024+18.1%
No external data added — only training-distribution correction via SQL filters over a single Lance table. Each panel below shows the same frame with three overlaid predictions: green = ground truth · red = pretrained COCO baseline · blue = fine-tuned model. Rider detection — ground truth vs baseline vs fine-tuned Nighttime pedestrian detection — ground truth vs baseline vs fine-tuned Distant pedestrian detection — ground truth vs baseline vs fine-tuned The rest of the page walks through the pipeline that produced these checkpoints.

The failure modes

A perception model fine-tuned on a generic dataset typically misses the long-tail scenarios that matter most in deployment. Three common failure modes drive this example:
Failure modeCuration signal
Riders (person on bike/motorcycle)has_rider = true
Nighttime pedestrianstimeofday = 'night' AND has_person = true
Distant pedestrianshas_person = true AND person_bbox_area_pct < 30.0
Each curated slice becomes a materialized view — a named, refreshable SQL filter over the source table — and the training script loads it by name. New footage flows in through add()backfill()refresh(); no manifests, no exports, no reshuffling on disk.

1. Schema

The source table holds raw image bytes alongside structured annotations. Bounding boxes are stored as a parallel list (one element per box) rather than a nested struct so they remain directly queryable with SQL.
Python
import pyarrow as pa

BDD_SCHEMA = pa.schema([
    pa.field("image_id",   pa.string()),
    pa.field("split",      pa.string()),         # "train" | "val"
    pa.field("image_bytes", pa.large_binary()),  # raw JPEG
    pa.field("width",      pa.int32()),
    pa.field("height",     pa.int32()),

    # scene metadata
    pa.field("weather",    pa.string()),
    pa.field("scene",      pa.string()),
    pa.field("timeofday",  pa.string()),

    # annotations — parallel lists, one element per box
    pa.field("ann_categories", pa.list_(pa.string())),
    pa.field("ann_bboxes",     pa.list_(pa.list_(pa.float32()))),
    pa.field("ann_occluded",   pa.list_(pa.bool_())),
])
Ingestion streams pa.RecordBatches of raw frames + annotations directly into a Lance table — no intermediate preprocessing job. The table can live on local disk, S3, GCS, or Azure; everything downstream (backfills, views, the training loader) opens it in place via lancedb.connect("s3://...") with no local copy step.

2. Backfill curation features with Geneva

Curation signals are added as columns on the same table using Geneva UDFs. Backfills are incremental and checkpointed: re-running the command after new footage arrives only computes the new rows.
Python
import pyarrow as pa
from geneva.transformer import udf

# Tier 1 — CPU, derived from annotations alone
@udf(data_type=pa.bool_(), input_columns=["ann_categories"])
def has_rider(ann_categories: list[str]) -> bool:
    return "rider" in (ann_categories or [])

# Tier 2 — GPU, runs a Faster R-CNN to find the largest detected person
# as a percentage of frame area. <30% = a distant or small pedestrian,
# the hard case we want to upweight in training.
@udf(data_type=pa.float32(),
     input_columns=["image_bytes", "width", "height"],
     cuda=True, num_gpus=1)
class PersonBboxAreaPct:
    def __init__(self):
        self._model = None

    def __call__(self, image_bytes, width, height):
        # lazy model load — runs once per Ray worker, then reused
        ...
Run the backfill against the live table:
Python
import geneva

gconn = geneva.connect("data/bdd100k/lancedb")
tbl   = gconn.open_table("bdd100k")

tbl.add_columns({"has_rider": has_rider})
tbl.add_columns({"person_bbox_area_pct": PersonBboxAreaPct()})

with gconn.local_ray_context():
    tbl.backfill("has_rider")
    tbl.backfill("person_bbox_area_pct")
Because the curation features are flat scalar columns on the same table, all four retrieval modes — SQL, full-text search, vector search, and SQL-filtered vector search — work directly without joins or exports. See the Geneva end-to-end example for more on the backfill pattern.

3. Define training splits as materialized views

A training split is a named SQL filter, not a CSV manifest. Each view stays in sync with the source table and bumps its version on every refresh — the link between a checkpoint and the exact data that produced it.
Python
import geneva

gconn = geneva.connect("data/bdd100k/lancedb")
gtbl  = gconn.open_table("bdd100k")

VIEWS = {
    "bdd100k_rider_train":
        "has_rider = true AND split = 'train'",
    "bdd100k_rider_val":
        "has_rider = true AND split = 'val'",
    "bdd100k_nighttime_person_train":
        "timeofday = 'night' AND has_person = true AND split = 'train'",
    "bdd100k_nighttime_person_val":
        "timeofday = 'night' AND has_person = true AND split = 'val'",
    "bdd100k_distant_person_train":
        "has_person = true AND person_bbox_area_pct < 30.0 AND split = 'train'",
    "bdd100k_distant_person_val":
        "has_person = true AND person_bbox_area_pct < 30.0 AND split = 'val'",
}

with gconn.local_ray_context():
    for name, sql_filter in VIEWS.items():
        query = gtbl.search().where(sql_filter)
        mv = gconn.create_materialized_view(name, query)
        mv.refresh()
        print(f"[{name}]  {mv.count_rows()} rows  (version {mv.version})")

4. PyTorch DataLoader via the Permutation API

The training script doesn’t know about the filter — it opens a view by name and reads through the Permutation API. Each DataLoader worker reopens its own connection lazily, reads Arrow batches directly from Lance (zero-copy, no intermediate file format), and the collate function decodes the whole batch in one pass. Permutation provides random-access indexing over the table, so shuffling is a cheap pointer rewrite rather than a full-dataset shuffle on disk.
Python
import lancedb
import torch
import torchvision.io as tio
from lancedb.permutation import Permutation

DETECTION_COLS = ["image_bytes", "ann_categories", "ann_bboxes"]

class LanceDetectionDataset(torch.utils.data.Dataset):
    def __init__(self, uri: str, table_name: str):
        self.uri, self.table_name = uri, table_name
        self._perm = None
        self.length = len(lancedb.connect(uri).open_table(table_name))

    def __len__(self):
        return self.length

    def __getstate__(self):
        # Permutation holds Rust async state — zero it so each worker reopens
        state = self.__dict__.copy()
        state["_perm"] = None
        return state

    def _ensure_open(self):
        if self._perm is None:
            tbl = lancedb.connect(self.uri).open_table(self.table_name)
            self._perm = (
                Permutation.identity(tbl)
                .select_columns(DETECTION_COLS)
                .with_format("arrow")  # zero-copy
            )

    def __getitems__(self, indices: list[int]):
        self._ensure_open()
        return self._perm.__getitems__(indices)
The collate function decodes JPEG bytes and converts BDD category strings into COCO class IDs (so the comparison against the pretrained checkpoint is valid):
Python
BDD_LABEL_MAP = {
    "person": 1, "rider": 1, "bicycle": 2, "car": 3, "motorcycle": 4,
    "bus": 6, "train": 7, "truck": 8, "traffic light": 10,
}


def detection_collate(batch):
    images, targets = [], []
    for raw, cats, bboxes in zip(
        batch.column("image_bytes").to_pylist(),
        batch.column("ann_categories").to_pylist(),
        batch.column("ann_bboxes").to_pylist(),
    ):
        buf = torch.frombuffer(bytearray(raw), dtype=torch.uint8)
        images.append(tio.decode_image(buf, tio.ImageReadMode.RGB).float() / 255.0)

        valid_boxes, valid_labels = [], []
        for cat, box in zip(cats or [], bboxes or []):
            lid = BDD_LABEL_MAP.get(cat)
            if lid is None or box[2] <= box[0] or box[3] <= box[1]:
                continue
            valid_boxes.append(box)
            valid_labels.append(lid)
        targets.append({
            "boxes":  torch.tensor(valid_boxes  or [], dtype=torch.float32).reshape(-1, 4),
            "labels": torch.tensor(valid_labels or [], dtype=torch.int64),
        })
    return images, targets
Wire it into a standard torch.utils.data.DataLoader:
Python
def make_loader(uri, table_name, batch_size=64, num_workers=8, shuffle=False):
    dataset = LanceDetectionDataset(uri, table_name)
    sampler = torch.utils.data.RandomSampler(dataset) if shuffle else None
    return torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        sampler=sampler,
        num_workers=num_workers,
        collate_fn=detection_collate,
        pin_memory=torch.cuda.is_available(),
        persistent_workers=(num_workers > 0),
        multiprocessing_context="spawn" if num_workers > 0 else None,
    )
with_format("arrow") keeps batches as zero-copy pa.RecordBatches — no per-row Python boxing, no pickling between worker and main. Each DataLoader worker reopens its own Permutation after fork (the Rust async handle is cleared in __getstate__), so reads scale with num_workers and stream straight from the underlying object store. JPEG decode overlaps with GPU compute via pin_memory + prefetch_factor, which is what keeps the loader from becoming the bottleneck on a fast GPU.

5. Fine-tune Faster R-CNN

The training loop is plain PyTorch — the Lance integration ends at the loader. Mixed precision is enabled on CUDA for ~2× speedup on Ampere GPUs.
Python
import time
import torch
from torchvision.models.detection import (
    fasterrcnn_resnet50_fpn_v2, FasterRCNN_ResNet50_FPN_V2_Weights,
)

device  = torch.device("cuda" if torch.cuda.is_available() else "cpu")
use_amp = device.type == "cuda"

# COCO pretrained weights — head left intact since BDD uses a subset of COCO IDs
model = fasterrcnn_resnet50_fpn_v2(
    weights=FasterRCNN_ResNet50_FPN_V2_Weights.COCO_V1
).to(device)

train_loader = make_loader("data/bdd100k/lancedb",
                           "bdd100k_rider_train",
                           batch_size=64, num_workers=14, shuffle=True)
val_loader   = make_loader("data/bdd100k/lancedb",
                           "bdd100k_rider_val",
                           batch_size=64, num_workers=14)

optimizer = torch.optim.SGD(
    [p for p in model.parameters() if p.requires_grad],
    lr=0.04, momentum=0.9, weight_decay=1e-4,
)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
scaler    = torch.cuda.amp.GradScaler() if use_amp else None

for epoch in range(1, 11):
    model.train()
    t0 = time.time()
    for images, targets in train_loader:
        images  = [img.to(device) for img in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        if all(t["labels"].numel() == 0 for t in targets):
            continue

        with torch.cuda.amp.autocast(enabled=use_amp):
            losses = sum(model(images, targets).values())

        optimizer.zero_grad()
        if use_amp:
            scaler.scale(losses).backward()
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 10.0)
            scaler.step(optimizer)
            scaler.update()
        else:
            losses.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 10.0)
            optimizer.step()

    scheduler.step()
    print(f"epoch {epoch}  ({time.time() - t0:.1f}s)")

6. Pin the checkpoint to a data version

Every Lance table — including a materialized view — exposes a monotonically increasing version. Logging it next to the weights gives a permanent, deterministic link between a checkpoint and the exact data snapshot that produced it.
Python
import json
from pathlib import Path

train_tbl = lancedb.connect("data/bdd100k/lancedb").open_table("bdd100k_rider_train")

out = Path("checkpoints/rider")
out.mkdir(parents=True, exist_ok=True)
torch.save(model.state_dict(), out / "fasterrcnn_bdd_finetuned.pt")
with open(out / "metadata.json", "w") as f:
    json.dump({
        "train_table":   train_tbl.name,
        "table_version": train_tbl.version,
        "row_count":     len(train_tbl),
    }, f, indent=2)
To reproduce a run, time-travel the view to the recorded version before opening the loader:
Python
tbl = lancedb.connect("data/bdd100k/lancedb").open_table("bdd100k_rider_train")
tbl.checkout(version=7)   # exact snapshot the checkpoint was trained on

7. Continuous updates

When new footage arrives, the same three calls update every downstream view — no view definitions change, no training-script edits required:
Python
# 1. ingest the new footage into the source table
table.add(new_record_batches)

# 2. backfill computes only the new rows (incremental, checkpointed)
with gconn.local_ray_context():
    tbl.backfill("has_rider")
    tbl.backfill("person_bbox_area_pct")

# 3. refresh appends qualifying new rows to every materialized view
for view_name in gconn.table_names():
    if view_name == "bdd100k":
        continue
    mv = gconn.open_table(view_name)
    before = mv.count_rows()
    mv.refresh()
    print(f"[{view_name}]  {before}{mv.count_rows()} rows  (version {mv.version})")
The next training run picks up the new data automatically — and pins itself to the new version.

Full source

The complete code, including a synthetic-data mode for pipeline verification (--synthetic 500), GPU UDFs for CLIP embeddings and dHash deduplication, and the EDA notebook, is in this GitHub repository.