> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# User-Defined Functions (UDFs)

> Define 1:1 transforms that add computed columns to your tables — embeddings, enrichment, scoring, and more.

UDFs are the core building block for feature engineering in Geneva. A UDF wraps a Python function and applies it to every row in a table, producing exactly **one output value per input row** (1:1). Use UDFs to compute embeddings, enrich data with external APIs, transform formats, or derive new features from existing columns.

## Defining a UDF

Converting your Python code to a Geneva UDF is simple.  There are three kinds of UDFs that you can provide — scalar UDFs, batched UDFs and stateful UDFs.

In all cases, Geneva uses Python type hints from your functions to infer the input and output
[arrow data types](https://arrow.apache.org/docs/python/api/datatypes.html) that LanceDB uses.

### Scalar UDFs

The **simplest** form is a scalar UDF, which processes one row at a time:

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
from geneva import udf

@udf
def area_udf(x: int, y: int) -> int:
    return x * y
```

This UDF will take the value of `x` and value of `y` from each row and return the product.  The `@udf` wrapper is all that is needed.

### Batched UDFs

For **better performance**, you can also define batch UDFs that process multiple rows at once.

You can use `pyarrow.Array`s:

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import pyarrow as pa
from geneva import udf

@udf(data_type=pa.int32())
def batch_filename_len(filename: pa.Array) -> pa.Array:
    lengths = [len(str(f)) for f in filename]
    return pa.array(lengths, type=pa.int32())
```

Or take entire rows using `pyarrow.RecordBatch`:

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import pyarrow as pa
from geneva import udf

@udf(data_type=pa.int32())
def recordbatch_filename_len(batch: pa.RecordBatch) -> pa.Array:
    filenames = batch["filename"]
    lengths = [len(str(f)) for f in filenames]
    return pa.array(lengths, type=pa.int32())
```

> **Note**:  Batch UDFS require you to specify `data_type` in the `@udf` decorator for batched UDFs which defines `pyarrow.DataType` of the returned `pyarrow.Array`.

### Struct outputs

A UDF can return multiple related values as a single `struct` column by setting `data_type` to a `pa.struct(...)` and returning a tuple (matched by field order) or a `dict` keyed by field name.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import io
import pyarrow as pa
from geneva import udf

@udf(
    data_type=pa.struct(
        [pa.field("width", pa.int32()), pa.field("height", pa.int32())]
    ),
)
def dimensions(image: bytes) -> tuple[int, int]:
    """Extract image dimensions (width, height)."""
    from PIL import Image

    img = Image.open(io.BytesIO(image))
    return img.size
```

Downstream UDFs can then read individual fields via dot notation in `input_columns` (see below).

### Struct fields and list inputs

You can pass nested `struct` fields directly into a UDF by specifying `input_columns` with dot notation. For list-typed inputs, Geneva can pass a NumPy array when the argument is annotated as `np.ndarray` (use `np.ndarray | None` for nullable lists).

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import numpy as np
import pyarrow as pa
from geneva import udf

struct_type = pa.struct([("vals", pa.list_(pa.int32()))])
schema = pa.schema([pa.field("info", struct_type)])

@udf(data_type=pa.int32(), input_columns=["info.vals"])
def sum_vals(vals: np.ndarray | None) -> int | None:
    if vals is None:
        return None
    assert isinstance(vals, np.ndarray)
    return int(np.sum(vals))
```

### Stateful UDFs

You can also define a **stateful** UDF that retains its state across calls.

This can be used to share code and **parameterize your UDFs**.  In the example below, the model being used is a parameter that can be specified at UDF registration time.  It can also be used to parameterize input column names of `pa.RecordBatch` batch UDFS.

This also can be used to **optimize expensive initialization** that may require heavy resources on the distributed workers.  For example, this can be used to load a model to the GPU once for all records sent to a worker instead of once per record or per batch of records.

A stateful UDF is a `Callable` class, with `__call__()` method.  The call method can be a scalar function or a batched function.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
from typing import Callable
from openai import OpenAI

@udf(data_type=pa.list_(pa.float32(), 1536))
class OpenAIEmbedding(Callable):
    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model
        # Per-worker openai client
        self.client: OpenAI | None = None

    def __call__(self, text: str) -> pa.Array:
        if self.client is None:
            self.client = OpenAI()

        resp = self.client.embeddings.create(model=self.model, input=text)
        return pa.array(resp.data[0].embeddings)
```

<Tip>
  For common providers like OpenAI and Gemini, Geneva ships [built-in UDFs](/geneva/udfs/providers) that handle API keys, retries, and batching for you — no custom class needed.
</Tip>

> **Note**:  The state is will be independently managed on each distributed Worker.

## UDF options

The `udf` can have extra annotations that specify resource requirements and operational characteristics.
These are just add parameters to the `udf(...)`.

### Resource requirements for UDFs

Some workers may require specific resources such as gpus, cpus and certain amounts of RAM.

You can provide these requirements by adding `num_cpus`, `num_gpus`, and `memory` parameters to the UDF.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf(..., num_cpus=1, num_gpus=0.5, memory = 4 * 1024**3) # require 1 CPU, 0.5 GPU, and 4GiB RAM
def func(...):
    ...
```

### Operational parameters for UDFs

#### checkpoint\_size

`checkpoint_size` controls how many rows are processed before checkpointing, and therefore reporting and saving progress.

UDFs can be quite varied: some can be simple operations where thousands of calls can be completed per second, while others may be slow and require 30s per row. So a simple default like "every 1000 rows" might write once a second or once every 8 hours!

Geneva will handle this internally, using an experimental feature that will adapt checkpoint sizing as a UDF progresses. However, if you want to see writes more or less frequently, you can set this manually. There are three parameters:

* `checkpoint_size`: the seed for the initial checkpoint size
* `min_checkpoint_size`: the minimum value that Geneva will use while adapting checkpoint size
* `max_checkpoint_size`: the maximum value that Geneva will use while adapting checkpoint size

Therefore, to force a checkpoint size (and effectively disable adaptive batch sizing), set all three of these parameters to the same value.

### Error handling

Depending on the UDF, you may want Geneva to ignore rows that hit failures, retry, or fail the entire job. For simple cases, Geneva provides a simple parameter, `on_error`, with the following options:

| Function            | Behavior                                                                    |
| ------------------- | --------------------------------------------------------------------------- |
| `retry_transient()` | Retry `ConnectionError`, `TimeoutError`, `OSError` with exponential backoff |
| `retry_all()`       | Retry any exception with exponential backoff                                |
| `skip_on_error()`   | Return `None` for any exception (skip the row)                              |
| `fail_fast()`       | Fail immediately on any exception (default behavior)                        |

If those are not specific enough, Geneva also provides [many more error handling options](/geneva/udfs/error_handling).

## Registering Features with UDFs

Registering a feature is done by providing the `Table.add_columns()` function a new column name and the Geneva UDF.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
import geneva
import numpy as np
import pyarrow as pa

lancedb_uri="gs://bucket/db"
db = geneva.connect(lancedb_uri)

# Define schema for the video table
schema = pa.schema([
    ("filename", pa.string()),
    ("duration_sec", pa.float32()),
    ("x", pa.int32()),
    ("y", pa.int32()),
])
tbl = db.create_table("videos", schema=schema, mode="overwrite")

# Generate fake data
N = 10
data = {
    "filename": [f"video_{i}.mp4" for i in range(N)],
    "duration_sec": np.random.uniform(10, 300, size=N).astype(np.float32),
    "x": np.random.choice([640, 1280, 1920], size=N),
    "y": np.random.choice([360, 720, 1080], size=N),
    "caption": [f"this is video {i}" for i in range(N)]
}

# Convert to Arrow Table and add to LanceDB
batch = pa.table(data, schema=schema)
tbl.add(batch)
```

Here's how to register a simple UDF:

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf
def area_udf(x: int, y: int) -> int:
    return x * y

@udf
def download_udf(filename: str) -> bytes:
    ...

# {'new column name': <udf>, ...}
# simple_udf's arguments are `x` and `y` so the input columns are
# inferred to be columns `x` amd `y`
tbl.add_columns({"area": area_udf, "content": download_udf })
```

Batched UDFs require return type in their `udf` annotations

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf(data_type=pa.int32())
def batch_filename_len(filename: pa.Array) -> pa.Array:
    ...

# {'new column name': <udf>}
# batch_filename_len's input, `filename` input column is
# specified by the UDF's argument name.
tbl.add_columns({"filename_len": batch_filename_len})
```

or

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf(data_type=pa.int32())
def recordbatch_filename_len(batch: pa.RecordBatch) -> pa.Array:
    ...

# {'new column name': <udf>}
# batch_filename_len's input.  pa.RecordBatch typed UDF
# argument pulls in all the column values for each row.
tbl.add_columns({"filename_len": recordbatch_filename_len})
```

Similarly, a stateful UDF is registered by providing an instance of the Callable object.  The call method may be a per-record function or a batch function.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
@udf(data_type=pa.list_(pa.float32(), 1536))
class OpenAIEmbedding(Callable):
    ...
    def __call__(self, text: str) -> pa.Array:
        ...

# OpenAIEmbedding's call method input is inferred to be 'text' of
# type string from the __call__'s arguments, and its output type is
# a fixed size list of float32.
tbl.add_columns({"embedding": OpenAIEmbedding()})
```

## Changing data in computed columns

Let's say you backfilled data with your UDF then you noticed that your data has some issues.  Here are a few scenarios:

1. All the values are incorrect due to a bug in the UDF.
2. Most values are correct but some values are incorrect due to a failure in UDF execution.
3. Values calculated correctly and you want to perform a second pass to fixup some of the values.

In scenario 1, you'll most likely want to replace the UDF with a new version and recalculate all the values.  You should perform a `alter_table` and then `backfill`.

In scenario 2, you'll most likely want to re-execute `backfill` to fill in the values.  If the error is in your code (certain cases not handled), you can modify the UDF, and perform an `alter_table`, and then `backfill` with some filters.

In scenario 3, you have a few options. A) You could `alter` your UDF and include the fixup operations in the UDF.  You'd `alter_table` and then `backfill` recalculating all the values.  B) You could have a chain of computed columns -- create a new column, calculate the "fixed" up values and have your application use the new column or a combination of the original column.  This is similar to A but does not recalculate A and can incur more storage.   C) You could `update` the values in the column with the fixed up values.  This may be expedient but also sacrifices reproducibility.

The next section shows you how to change your column definition by `alter`ing the UDF.

## Altering UDFs

You now want to revise the code.  To make the change, you'd update the UDF used to compute the column using the `alter_columns` API and the updated function.  The example below replaces the definition of column `area` to use the `area_udf_v2` function.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
table.alter_columns({"path": "area", "udf": area_udf_v2} )
```

After making this change, the existing data already in the table does not change. However, when you perform your next basic `backfill` operation, all values would be recalculated and updated. If you only wanted some rows updated, you could perform a filtered backfill, targeting the specific rows that need the new upates.

For example, this filter would only update the rows where area was currently null.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
table.backfill("area", where="area is null")
```

Reference:

* [`alter_columns` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.alter_columns)
* [`add_columns` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.add_columns)
* [UDF](https://lancedb.github.io/geneva/api/udf/) — full `@udf` decorator reference including `data_type`, `num_gpus`, `batch_size`, and other options
