> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Scalar User-Defined Table Functions (UDTFs)

> Use scalar UDTFs for 1:N row expansion — split videos into clips, chunk documents, or tile images with automatic parent column inheritance and incremental refresh.

export const PyDocumentChunkingFull = "from geneva import connect, scalar_udtf, udf\nfrom typing import Iterator, NamedTuple\nimport pyarrow as pa\n\nclass Chunk(NamedTuple):\n    chunk_index: int\n    chunk_text: str\n\n@scalar_udtf\ndef chunk_document(text: str) -> Iterator[Chunk]:\n    \"\"\"Split a document into overlapping chunks.\"\"\"\n    words = text.split()\n    chunk_size = 500\n    overlap = 50\n    for i, start in enumerate(range(0, len(words), chunk_size - overlap)):\n        chunk_words = words[start:start + chunk_size]\n        yield Chunk(chunk_index=i, chunk_text=\" \".join(chunk_words))\n\ndb = connect(\"/data/mydb\")\ndocs = db.open_table(\"documents\")\n\n# Create chunked view — inherits doc_id, title, etc. from source\nchunks = db.create_scalar_udtf_view(\n    \"doc_chunks\",\n    source=docs.search(None).select([\"doc_id\", \"title\", \"text\"]),\n    scalar_udtf=chunk_document,\n)\nchunks.refresh()\n\n# Add embeddings to chunks for semantic search\n@udf(data_type=pa.list_(pa.float32(), 1536))\ndef embed_text(chunk_text: str) -> list[float]:\n    return embedding_model.encode(chunk_text)\n\nchunks.add_columns({\"embedding\": embed_text})\nchunks.backfill(\"embedding\")  # Backfills embeddings on all existing chunks\n\n# Query — parent columns available alongside chunk columns\nchunks.search(None).select([\"doc_id\", \"title\", \"chunk_text\", \"embedding\"]).to_pandas()\n";

export const PyChainingUdtfViews = "# videos → clips (1:N)\nclips = db.create_scalar_udtf_view(\n    \"clips\", source=videos.search(None), scalar_udtf=extract_clips\n)\n\n# clips → frames (1:N)\nframes = db.create_scalar_udtf_view(\n    \"frames\", source=clips.search(None), scalar_udtf=extract_frames\n)\n";

export const PyIncrementalRefresh = "# Add new videos to the source table\nvideos.add(new_video_data)\n\n# Incremental refresh — only processes the new videos\nclips.refresh()\n";

export const PyAddColumnsScalarUdtf = "@udf(data_type=pa.list_(pa.float32(), 512))\ndef clip_embedding(clip_bytes: bytes) -> list[float]:\n    return embed_model.encode(clip_bytes)\n\n# Add an embedding column to the clips table\nclips.add_columns({\"embedding\": clip_embedding})\n\n# Backfill computes embeddings for all existing clips\nclips.backfill(\"embedding\")\n";

export const PyCreateScalarUdtfView = "import geneva\n\ndb = geneva.connect(\"/data/mydb\")\nvideos = db.open_table(\"videos\")\n\n# Create the 1:N materialized view\nclips = db.create_scalar_udtf_view(\n    \"clips\",\n    source=videos.search(None).select([\"video_path\", \"metadata\"]),\n    scalar_udtf=extract_clips,\n)\n\n# Populate — runs the UDTF on every source row\nclips.refresh()\n";

export const PyScalarUdtfBatch = "@scalar_udtf(batch=True, output_schema=clip_schema)\ndef extract_clips(batch: pa.RecordBatch) -> pa.RecordBatch:\n    \"\"\"Process rows in batches. Same 1:N semantic per row.\"\"\"\n    ...\n";

export const PyScalarUdtfList = "@scalar_udtf\ndef extract_clips(video_path: str, duration: float) -> list[Clip]:\n    clips = []\n    for start in range(0, int(duration), 10):\n        end = min(start + 10, duration)\n        clips.append(Clip(clip_start=start, clip_end=end, clip_bytes=b\"...\"))\n    return clips\n";

export const PyScalarUdtfIterator = "from geneva import scalar_udtf\nfrom typing import Iterator, NamedTuple\n\nclass Clip(NamedTuple):\n    clip_start: float\n    clip_end: float\n    clip_bytes: bytes\n\n@scalar_udtf\ndef extract_clips(video_path: str, duration: float) -> Iterator[Clip]:\n    \"\"\"Yields multiple clips per video.\"\"\"\n    clip_length = 10.0\n    for start in range(0, int(duration), int(clip_length)):\n        end = min(start + clip_length, duration)\n        clip_data = extract_video_segment(video_path, start, end)\n        yield Clip(clip_start=start, clip_end=end, clip_bytes=clip_data)\n";

<Badge>Beta — introduced in Geneva 0.11.0</Badge>

Standard UDFs produce exactly **one output value per input row**. Scalar UDTFs enable **1:N row expansion** — each source row can produce multiple output rows. The results are stored as a materialized view with MV-style incremental refresh.

| Source Table   | Derived Table  | Expansion          |
| -------------- | -------------- | ------------------ |
| 1 video row    | → N clip rows  | Video segmentation |
| 1 document row | → N chunk rows | Text chunking      |
| 1 image row    | → N tile rows  | Image tiling       |

## Defining a Scalar UDTF

Use the `@scalar_udtf` decorator on a function that **yields** output rows. Geneva infers the output schema from the return type annotation.

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyScalarUdtfIterator}
  </CodeBlock>
</CodeGroup>

Input parameters are bound to source columns **by name** — the parameter `video_path` binds to source column `video_path`, just like standard UDFs.

<Tip>
  A scalar UDTF can yield **zero rows** for a source row. The source row is still marked as processed and will not be retried on the next refresh.
</Tip>

### List return pattern

If you prefer to build the full list in memory rather than yielding, you can return a `list` instead of an `Iterator`:

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyScalarUdtfList}
  </CodeBlock>
</CodeGroup>

### Batched scalar UDTF

For vectorized processing, use `batch=True`. The function receives Arrow arrays and returns a `RecordBatch` of expanded rows. Because the return type `pa.RecordBatch` cannot be inferred, you must supply `output_schema` explicitly:

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyScalarUdtfBatch}
  </CodeBlock>
</CodeGroup>

## Creating a Scalar UDTF View

Scalar UDTFs use the `create_scalar_udtf_view` API:

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyCreateScalarUdtfView}
  </CodeBlock>
</CodeGroup>

The `query` parameter controls which source columns are inherited. Columns listed in `.select()` are carried into every child row automatically.

## Inherited Columns

Child rows automatically include the parent's columns — no manual join required. The columns available in the child table are determined by the query's `.select()`:

### `videos` table (source)

| video\_path | duration | metadata   |
| ----------- | -------- | ---------- |
| /v/a.mp4    | 120.0    | \{fps: 30} |
| /v/b.mp4    | 60.0     | \{fps: 24} |

### `clips` table (derived, 1:N)

| video\_path | metadata   | clip\_start | clip\_end | clip\_bytes    |
| ----------- | ---------- | ----------- | --------- | -------------- |
| /v/a.mp4    | \{fps: 30} | 0.0         | 10.0      | b"\x00\x1a..." |
| /v/a.mp4    | \{fps: 30} | 10.0        | 20.0      | b"\x00\x2b..." |
| /v/a.mp4    | \{fps: 30} | 20.0        | 30.0      | b"\x00\x3c..." |
|             |            |             |           |                |
| /v/b.mp4    | \{fps: 24} | 0.0         | 10.0      | b"\x00\x4d..." |
| /v/b.mp4    | \{fps: 24} | 10.0        | 20.0      | b"\x00\x5e..." |

The first three rows come from the `/v/a.mp4` source row, the last two from `/v/b.mp4`. Inherited columns (`video_path`, `metadata`) are carried over automatically; `clip_start`, `clip_end`, and `clip_bytes` are generated by the UDTF.

## Adding Computed Columns After Creation

Since scalar UDTF views are materialized views, you can add UDF-computed columns to the child table and backfill them:

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyAddColumnsScalarUdtf}
  </CodeBlock>
</CodeGroup>

This is a powerful pattern: expand source rows with a scalar UDTF, then enrich the expanded rows with standard UDFs.

## Incremental Refresh

Scalar UDTFs support **incremental refresh**, just like standard materialized views:

* **New source rows**: The UDTF runs on new rows, inserting child rows.
* **Deleted source rows**: Child rows linked to the deleted parent are cascade-deleted.
* **Updated source rows**: Old children are deleted, UDTF re-runs, new children inserted.

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyIncrementalRefresh}
  </CodeBlock>
</CodeGroup>

Only the new source rows are processed. Existing clips from previous refreshes are untouched.

## Chaining UDTF Views

Scalar UDTF views are standard materialized views, so they can serve as the source for further views:

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyChainingUdtfViews}
  </CodeBlock>
</CodeGroup>

## Full Example: Document Chunking

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyDocumentChunkingFull}
  </CodeBlock>
</CodeGroup>

For a comparison of all three function types (UDFs, Scalar UDTFs, Batch UDTFs), see [Understanding Transforms](/geneva/udfs).

Reference:

* [`scalar_udtf` API](https://lancedb.github.io/geneva/api/udtf/#geneva.scalar_udtf)
* [`create_scalar_udtf_view` API](https://lancedb.github.io/geneva/api/connection/#geneva.db.Connection.create_scalar_udtf_view)
