> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# User-Defined Functions (UDFs)

> Define 1:1 transforms that add computed columns to your tables — embeddings, enrichment, scoring, and more.

UDFs are the core building block for feature engineering in Geneva. A UDF wraps a Python function and applies it to every row in a table, producing exactly **one output value per input row** (1:1). Use UDFs to compute embeddings, enrich data with external APIs, transform formats, or derive new features from existing columns.

## Defining a UDF

Converting your Python code to a Geneva UDF is simple.  There are three kinds of UDFs that you can provide — scalar UDFs, batched UDFs and stateful UDFs.

In all cases, Geneva uses Python type hints from your functions to infer the input and output
[arrow data types](https://arrow.apache.org/docs/python/api/datatypes.html) that LanceDB uses.

### Scalar UDFs

The **simplest** form is a scalar UDF, which processes one row at a time:

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
from geneva import udf

@udf
def area_udf(x: int, y: int) -> int:
    return x * y
```

This UDF will take the value of `x` and value of `y` from each row and return the product.  The `@udf` wrapper is all that is needed.

### Batched UDFs

For **better performance**, you can also define batch UDFs that process multiple rows at once.

You can use `pyarrow.Array`s:

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import pyarrow as pa
from geneva import udf

@udf(data_type=pa.int32())
def batch_filename_len(filename: pa.Array) -> pa.Array:
    lengths = [len(str(f)) for f in filename]
    return pa.array(lengths, type=pa.int32())
```

Or take entire rows using `pyarrow.RecordBatch`:

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import pyarrow as pa
from geneva import udf

@udf(data_type=pa.int32())
def recordbatch_filename_len(batch: pa.RecordBatch) -> pa.Array:
    filenames = batch["filename"]
    lengths = [len(str(f)) for f in filenames]
    return pa.array(lengths, type=pa.int32())
```

> **Note**:  Batch UDFS require you to specify `data_type` in the `@udf` decorator for batched UDFs which defines `pyarrow.DataType` of the returned `pyarrow.Array`.

### Struct outputs

A UDF can return multiple related values as a single `struct` column by setting `data_type` to a `pa.struct(...)` and returning a tuple (matched by field order) or a `dict` keyed by field name.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import io
import pyarrow as pa
from geneva import udf

@udf(
    data_type=pa.struct(
        [pa.field("width", pa.int32()), pa.field("height", pa.int32())]
    ),
)
def dimensions(image: bytes) -> tuple[int, int]:
    """Extract image dimensions (width, height)."""
    from PIL import Image

    img = Image.open(io.BytesIO(image))
    return img.size
```

Downstream UDFs can then read individual fields via dot notation in `input_columns` (see below).

### Struct fields and list inputs

You can pass nested `struct` fields directly into a UDF by specifying `input_columns` with dot notation. For list-typed inputs, Geneva can pass a NumPy array when the argument is annotated as `np.ndarray` (use `np.ndarray | None` for nullable lists).

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import numpy as np
import pyarrow as pa
from geneva import udf

struct_type = pa.struct([("vals", pa.list_(pa.int32()))])
schema = pa.schema([pa.field("info", struct_type)])

@udf(data_type=pa.int32(), input_columns=["info.vals"])
def sum_vals(vals: np.ndarray | None) -> int | None:
    if vals is None:
        return None
    assert isinstance(vals, np.ndarray)
    return int(np.sum(vals))
```

### Stateful UDFs

You can also define a **stateful** UDF that retains its state across calls.

This can be used to share code and **parameterize your UDFs**.  In the example below, the model being used is a parameter that can be specified at UDF registration time.  It can also be used to parameterize input column names of `pa.RecordBatch` batch UDFS.

This also can be used to **optimize expensive initialization** that may require heavy resources on the distributed workers.  For example, this can be used to load a model to the GPU once for all records sent to a worker instead of once per record or per batch of records.

A stateful UDF is a `Callable` class, with `__call__()` method.  The call method can be a scalar function or a batched function.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
from typing import Callable
from openai import OpenAI

@udf(data_type=pa.list_(pa.float32(), 1536))
class OpenAIEmbedding(Callable):
    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model
        # Per-worker openai client
        self.client: OpenAI | None = None

    def __call__(self, text: str) -> pa.Array:
        if self.client is None:
            self.client = OpenAI()

        resp = self.client.embeddings.create(model=self.model, input=text)
        return pa.array(resp.data[0].embeddings)
```

<Tip>
  For common providers like OpenAI and Gemini, Geneva ships [built-in UDFs](/geneva/udfs/providers) that handle API keys, retries, and batching for you — no custom class needed.
</Tip>

> **Note**:  The state is will be independently managed on each distributed Worker.

## UDF options

The `udf` can have extra annotations that specify resource requirements and operational characteristics.
These are just add parameters to the `udf(...)`.

### Resource requirements for UDFs

Some workers may require specific resources such as gpus, cpus and certain amounts of RAM.

You can provide these requirements by adding `num_cpus`, `num_gpus`, and `memory` parameters to the UDF.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf(..., num_cpus=1, num_gpus=0.5, memory = 4 * 1024**3) # require 1 CPU, 0.5 GPU, and 4GiB RAM
def func(...):
    ...
```

### Operational parameters for UDFs

#### checkpoint\_size

`checkpoint_size` controls how many rows are processed before checkpointing, and therefore reporting and saving progress.

UDFs can be quite varied: some can be simple operations where thousands of calls can be completed per second, while others may be slow and require 30s per row. So a simple default like "every 1000 rows" might write once a second or once every 8 hours!

Geneva will handle this internally, using an experimental feature that will adapt checkpoint sizing as a UDF progresses. However, if you want to see writes more or less frequently, you can set this manually. There are three parameters:

* `checkpoint_size`: the seed for the initial checkpoint size
* `min_checkpoint_size`: the minimum value that Geneva will use while adapting checkpoint size
* `max_checkpoint_size`: the maximum value that Geneva will use while adapting checkpoint size

Therefore, to force a checkpoint size (and effectively disable adaptive batch sizing), set all three of these parameters to the same value.

### Error handling

Depending on the UDF, you may want Geneva to ignore rows that hit failures, retry, or fail the entire job. For simple cases, Geneva provides a simple parameter, `on_error`, with the following options:

| Function            | Behavior                                                                    |
| ------------------- | --------------------------------------------------------------------------- |
| `retry_transient()` | Retry `ConnectionError`, `TimeoutError`, `OSError` with exponential backoff |
| `retry_all()`       | Retry any exception with exponential backoff                                |
| `skip_on_error()`   | Return `None` for any exception (skip the row)                              |
| `fail_fast()`       | Fail immediately on any exception (default behavior)                        |

If those are not specific enough, Geneva also provides [many more error handling options](/geneva/udfs/error_handling).

## Registering Features with UDFs

Registering a feature is done by providing the `Table.add_columns()` function a new column name and the Geneva UDF.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import geneva
import numpy as np
import pyarrow as pa

lancedb_uri="gs://bucket/db"
db = geneva.connect(lancedb_uri)

# Define schema for the video table
schema = pa.schema([
    ("filename", pa.string()),
    ("duration_sec", pa.float32()),
    ("x", pa.int32()),
    ("y", pa.int32()),
])
tbl = db.create_table("videos", schema=schema, mode="overwrite")

# Generate fake data
N = 10
data = {
    "filename": [f"video_{i}.mp4" for i in range(N)],
    "duration_sec": np.random.uniform(10, 300, size=N).astype(np.float32),
    "x": np.random.choice([640, 1280, 1920], size=N),
    "y": np.random.choice([360, 720, 1080], size=N),
    "caption": [f"this is video {i}" for i in range(N)]
}

# Convert to Arrow Table and add to LanceDB
batch = pa.table(data, schema=schema)
tbl.add(batch)
```

Here's how to register a simple UDF:

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf
def area_udf(x: int, y: int) -> int:
    return x * y

@udf
def download_udf(filename: str) -> bytes:
    ...

# {'new column name': <udf>, ...}
# simple_udf's arguments are `x` and `y` so the input columns are
# inferred to be columns `x` amd `y`
tbl.add_columns({"area": area_udf, "content": download_udf })
```

### Registering Multi-Output UDFs

Use a multi-output UDF when one expensive read or decode can produce several features. For example, if a table stores image bytes, a single UDF can open the image once and return `height`, `width`, and an embedding column together. This avoids separate UDFs that would each read or decode the same image.

Define the output shape with `typing.NamedTuple` and annotate the UDF return type as `geneva.Columns[YourNamedTuple]`. Passing that UDF directly to [`Table.add_columns()`](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.add_columns) expands the result into multiple sibling columns using the `NamedTuple` field names.

If those names need a namespace or would conflict with existing columns, wrap the UDF with `geneva.UnpackedUDF(udf, prefix="...")` before calling `add_columns()`. The prefix is added to each materialized column name while keeping the outputs in one logical feature group.

Manage multi-output sibling columns as a group. Backfill, drop, or alter the full group together instead of changing only one sibling column.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import io
from typing import NamedTuple

import geneva
from PIL import Image

db = geneva.connect("/data/mydb")
tbl = db.open_table("images")


class ImageFeatures(NamedTuple):
    height: int
    width: int
    embedding: list[float]


@geneva.udf
def image_features(image: bytes) -> geneva.Columns[ImageFeatures]:
    img = Image.open(io.BytesIO(image))  # Read and decode the image once.
    embedding = embedding_model.encode(img)
    return ImageFeatures(
        height=img.height,
        width=img.width,
        embedding=embedding,
    )


# Adds sibling columns named "height", "width", and "embedding".
tbl.add_columns(image_features)

# Or add the same outputs with a prefix to avoid name conflicts.
tbl.add_columns(geneva.UnpackedUDF(image_features, prefix="image_"))
```

Batched UDFs require return type in their `udf` annotations

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf(data_type=pa.int32())
def batch_filename_len(filename: pa.Array) -> pa.Array:
    ...

# {'new column name': <udf>}
# batch_filename_len's input, `filename` input column is
# specified by the UDF's argument name.
tbl.add_columns({"filename_len": batch_filename_len})
```

or

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf(data_type=pa.int32())
def recordbatch_filename_len(batch: pa.RecordBatch) -> pa.Array:
    ...

# {'new column name': <udf>}
# batch_filename_len's input.  pa.RecordBatch typed UDF
# argument pulls in all the column values for each row.
tbl.add_columns({"filename_len": recordbatch_filename_len})
```

Similarly, a stateful UDF is registered by providing an instance of the Callable object.  The call method may be a per-record function or a batch function.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf(data_type=pa.list_(pa.float32(), 1536))
class OpenAIEmbedding(Callable):
    ...
    def __call__(self, text: str) -> pa.Array:
        ...

# OpenAIEmbedding's call method input is inferred to be 'text' of
# type string from the __call__'s arguments, and its output type is
# a fixed size list of float32.
tbl.add_columns({"embedding": OpenAIEmbedding()})
```

## Changing data in computed columns

Let's say you backfilled data with your UDF then you noticed that your data has some issues.  Here are a few scenarios:

1. All the values are incorrect due to a bug in the UDF.
2. Most values are correct but some values are incorrect due to a failure in UDF execution.
3. Values calculated correctly and you want to perform a second pass to fixup some of the values.

In scenario 1, you'll most likely want to replace the UDF with a new version and recalculate all the values.  You should perform a `alter_table` and then `backfill`.

In scenario 2, you'll most likely want to re-execute `backfill` to fill in the values.  If the error is in your code (certain cases not handled), you can modify the UDF, and perform an `alter_table`, and then `backfill` with some filters.

In scenario 3, you have a few options. A) You could `alter` your UDF and include the fixup operations in the UDF.  You'd `alter_table` and then `backfill` recalculating all the values.  B) You could have a chain of computed columns -- create a new column, calculate the "fixed" up values and have your application use the new column or a combination of the original column.  This is similar to A but does not recalculate A and can incur more storage.   C) You could `update` the values in the column with the fixed up values.  This may be expedient but also sacrifices reproducibility.

The next section shows you how to change your column definition by `alter`ing the UDF.

## Altering UDFs

You now want to revise the code.  To make the change, you'd update the UDF used to compute the column using the `alter_columns` API and the updated function.  The example below replaces the definition of column `area` to use the `area_udf_v2` function.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
table.alter_columns({"path": "area", "udf": area_udf_v2} )
```

After making this change, the existing data already in the table does not change. However, when you perform your next basic `backfill` operation, all values would be recalculated and updated. If you only wanted some rows updated, you could perform a filtered backfill, targeting the specific rows that need the new upates.

For example, this filter would only update the rows where area was currently null.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
table.backfill("area", where="area is null")
```

## Auto-backfill

For columns whose values should always stay in sync with their source data, set
`auto_backfill=True` on the UDF. On LanceDB Enterprise (`db://` connections), the column is
then recomputed for you automatically — you don't need to call `backfill()` yourself.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf(data_type=pa.int32(), version="2", auto_backfill=True)
def area_udf(x: int, y: int) -> int:
    return x * y

# Register a new auto-backfill column...
tbl.add_columns({"area": area_udf})

# ...or re-point an existing column to an auto-backfill UDF
tbl.alter_columns({"path": "area", "udf": area_udf})
```

### How it works

The `auto_backfill` flag is recorded in the column's metadata when the column is added or
altered. LanceDB Enterprise's managed agent watches for columns that need recomputation and
dispatches a [distributed backfill job](/geneva/jobs/backfilling/) automatically — there is no
manual trigger and no status polling. A column is recomputed when, for example:

* **New rows are added** to the table (`tbl.add(...)`), leaving the column null for those rows.
* **The UDF version changes** — you bump `version=` and `alter_columns()` to the new function.

<Note>
  Auto-backfill is an enterprise feature. On direct object-storage or local-filesystem
  connections there is no managed agent, so `auto_backfill=True` has no effect and you must run
  `backfill()` explicitly.
</Note>

Reference:

* [`alter_columns` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.alter_columns)
* [`add_columns` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.add_columns)
* [UDF](https://lancedb.github.io/geneva/api/udf/) — full `@udf` decorator reference including `data_type`, `num_gpus`, `auto_backfill`, `batch_size`, and other options
