Advanced Job Configuration

Enterprise-only On LanceDB Enterprise, backfill and refresh jobs run on a managed, distributed execution environment configured at deployment time:

the default cluster — the compute pool jobs run on, and
the default manifest — the Python dependency environment (image and packages) the distributed workers run with.

These defaults are set in the LanceDB Helm chart and cover most workloads. When a transform needs dependencies that differ from the deployment default, pin a manifest on the transform itself, as described below.

To override the cluster a job runs on — for example to route an embedding backfill to a GPU pool — see Advanced Execution Contexts.

Pinning a dependency manifest

A manifest pins the Python image and packages the distributed workers run with. Build one with the manifest builders, then attach it to your transform with the manifest= argument on @udf, @chunker, or @udtf. The manifest is snapshotted into the column (or view) metadata when the transform is registered, so every backfill or refresh of that transform uses it automatically — there is no per-call manifest argument to remember.

Manifests are immutable at the column / view level. When a transform is registered, its manifest is snapshotted onto the column (or view) metadata. Changing the deployment-default manifest — or the GenevaManifest object in your code — does not affect existing columns or views: they keep using the snapshot taken at creation time. To move a column or view to a new manifest, re-point it to a new (or updated) UDF / chunker / UDTF — for example with alter_columns() for a column, or by recreating the view.

import pyarrow as pa
from typing import Iterator, NamedTuple
from geneva import udf, chunker, udtf
from geneva.manifest import GenevaManifest

# Build a manifest that pins the dependencies these transforms need
embed_manifest = (
    GenevaManifest.create_pip("embedding-deps")
    .pip(["sentence-transformers==3.3.1", "torch==2.5.1"])
    .build()
)

`@udf(manifest=...)`

Pin dependencies for a 1:1 computed column:

@udf(data_type=pa.list_(pa.float32(), 384), manifest=embed_manifest)
def embed(text: str) -> list[float]:
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("BAAI/bge-small-en-v1.5")
    return model.encode(text, normalize_embeddings=True).tolist()

tbl.add_columns({"embedding": embed})
tbl.backfill("embedding")   # the backfill job runs with embed_manifest

`@chunker(manifest=...)`

Pin dependencies for a 1:N chunker (scalar UDTF):

class Chunk(NamedTuple):
    chunk_index: int
    chunk_text: str

@chunker(manifest=embed_manifest)
def split_document(text: str) -> Iterator[Chunk]:
    for i, part in enumerate(text.split("\n\n")):
        yield Chunk(chunk_index=i, chunk_text=part)

view = db.create_udtf_view("chunks", source=tbl.search(None), udtf=split_document)
view.refresh()   # the refresh job runs with embed_manifest

`@udtf(manifest=...)`

Pin dependencies for an N:M batch UDTF:

@udtf(
    output_schema=pa.schema([
        pa.field("label", pa.string()),
        pa.field("count", pa.int64()),
    ]),
    manifest=embed_manifest,
)
def group_stats(source) -> Iterator[pa.RecordBatch]:
    df = source.to_pandas()
    agg = df.groupby("label").size().reset_index(name="count")
    yield pa.RecordBatch.from_pandas(agg)

view = db.create_udtf_view("summaries", source=tbl.search(None), udtf=group_stats)
view.refresh()   # the refresh job runs with embed_manifest

Capturing your local environment for testing

When iterating locally, you often want the workers to run with the exact packages from your current environment rather than a curated pip list. Connection.capture_local_environment() zips your workspace (and, optionally, your site-packages), uploads the archives through the connection, and returns a ready-to-use GenevaManifest you can attach to a transform with manifest=.

import os
import pyarrow as pa
import geneva
from geneva import udf

db = geneva.connect(
    uri="db://my-db",
    host_override=os.getenv("LANCEDB_URI"),
    api_key=os.getenv("LANCEDB_API_KEY"),
)

# Capture the local workspace; rely on the worker image for site-packages
manifest = db.capture_local_environment(skip_site_packages=True)

@udf(data_type=pa.string(), manifest=manifest)
def shout(text: str) -> str:
    return text.upper()

tbl = db.open_table("my_table")
tbl.add_columns({"shout": shout})
tbl.backfill("shout")   # workers run with your captured environment

Pass skip_site_packages=False (the default) to also upload your local site-packages.

Manifest resolution

For a given transform, the manifest is resolved in this order (first match wins):

The manifest pinned on the transform via @udf / @chunker / @udtf manifest=.
For a materialized view, the manifest snapshotted on the view when it was created.
The deployment-default manifest from the LanceDB Helm chart.

The manifest= argument applies to managed enterprise (db://) jobs. For direct object-storage or local-filesystem connections, configure the dependency environment explicitly with an Advanced Execution Context instead.

Get started

Model training

Guides

Feature Engineering (Geneva)

Support

Advanced Job Configuration

Pinning a dependency manifest

`@udf(manifest=...)`

`@chunker(manifest=...)`

`@udtf(manifest=...)`

Capturing your local environment for testing

Manifest resolution

​Pinning a dependency manifest

​@udf(manifest=...)

​@chunker(manifest=...)

​@udtf(manifest=...)

​Capturing your local environment for testing

​Manifest resolution

Pinning a dependency manifest

`@udf(manifest=...)`

`@chunker(manifest=...)`

`@udtf(manifest=...)`

Capturing your local environment for testing

Manifest resolution