> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Chunkers (Scalar UDTFs)

> Use chunkers (scalar UDTFs) for 1:N row expansion — split videos into clips, chunk documents, or tile images with automatic parent column inheritance and incremental refresh.

export const PyDocumentChunkingFull = "from geneva import connect, chunker, udf\nfrom typing import Iterator, NamedTuple\nimport pyarrow as pa\n\nclass Chunk(NamedTuple):\n    chunk_index: int\n    chunk_text: str\n\n@chunker\ndef chunk_document(text: str) -> Iterator[Chunk]:\n    \"\"\"Split a document into overlapping chunks.\"\"\"\n    words = text.split()\n    chunk_size = 500\n    overlap = 50\n    for i, start in enumerate(range(0, len(words), chunk_size - overlap)):\n        chunk_words = words[start:start + chunk_size]\n        yield Chunk(chunk_index=i, chunk_text=\" \".join(chunk_words))\n\ndb = connect(\"/data/mydb\")\ndocs = db.open_table(\"documents\")\n\n# Create chunked view — inherits doc_id, title, etc. from source\nchunks = db.create_udtf_view(\n    \"doc_chunks\",\n    source=docs.search(None).select([\"doc_id\", \"title\", \"text\"]),\n    udtf=chunk_document,\n)\nchunks.refresh()\n\n# Add embeddings to chunks for semantic search\n@udf(data_type=pa.list_(pa.float32(), 1536))\ndef embed_text(chunk_text: str) -> list[float]:\n    return embedding_model.encode(chunk_text)\n\nchunks.add_columns({\"embedding\": embed_text})\nchunks.backfill(\"embedding\")  # Backfills embeddings on all existing chunks\n\n# Query — parent columns available alongside chunk columns\nchunks.search(None).select([\"doc_id\", \"title\", \"chunk_text\", \"embedding\"]).to_pandas()\n";

export const PyChainingUdtfViews = "# videos → clips (1:N)\nclips = db.create_udtf_view(\n    \"clips\", source=videos.search(None), udtf=extract_clips\n)\n\n# clips → frames (1:N)\nframes = db.create_udtf_view(\n    \"frames\", source=clips.search(None), udtf=extract_frames\n)\n";

export const PyIncrementalRefresh = "# Add new videos to the source table\nvideos.add(new_video_data)\n\n# Incremental refresh — only processes the new videos\nclips.refresh()\n";

export const PyAddColumnsScalarUdtf = "@udf(data_type=pa.list_(pa.float32(), 512))\ndef clip_embedding(clip_bytes: bytes) -> list[float]:\n    return embed_model.encode(clip_bytes)\n\n# Add an embedding column to the clips table\nclips.add_columns({\"embedding\": clip_embedding})\n\n# Backfill computes embeddings for all existing clips\nclips.backfill(\"embedding\")\n";

export const PyCreateScalarUdtfView = "import geneva\n\ndb = geneva.connect(\"/data/mydb\")\nvideos = db.open_table(\"videos\")\n\n# Create the 1:N materialized view\nclips = db.create_udtf_view(\n    \"clips\",\n    source=videos.search(None).select([\"video_path\", \"metadata\"]),\n    udtf=extract_clips,\n)\n\n# Populate — runs the UDTF on every source row\nclips.refresh()\n";

export const PyScalarUdtfBatch = "@chunker(batch=True, output_schema=clip_schema)\ndef extract_clips(batch: pa.RecordBatch) -> pa.RecordBatch:\n    \"\"\"Process rows in batches. Same 1:N semantic per row.\"\"\"\n    ...\n";

export const PyScalarUdtfList = "@chunker\ndef extract_clips(video_path: str, duration: float) -> list[Clip]:\n    clips = []\n    for start in range(0, int(duration), 10):\n        end = min(start + 10, duration)\n        clips.append(Clip(clip_start=start, clip_end=end, clip_bytes=b\"...\"))\n    return clips\n";

export const PyScalarUdtfIterator = "from geneva import chunker\nfrom typing import Iterator, NamedTuple\n\nclass Clip(NamedTuple):\n    clip_start: float\n    clip_end: float\n    clip_bytes: bytes\n\n@chunker\ndef extract_clips(video_path: str, duration: float) -> Iterator[Clip]:\n    \"\"\"Yields multiple clips per video.\"\"\"\n    clip_length = 10.0\n    for start in range(0, int(duration), int(clip_length)):\n        end = min(start + clip_length, duration)\n        clip_data = extract_video_segment(video_path, start, end)\n        yield Clip(clip_start=start, clip_end=end, clip_bytes=clip_data)\n";

<Badge>Beta — introduced in Geneva 0.11.0</Badge>

Standard UDFs produce exactly **one output value per input row**. **Chunkers** — also
called scalar UDTFs — enable **1:N row expansion**: each source row can produce multiple
output rows. The results are stored as a materialized view with MV-style incremental refresh.

| Source Table   | Derived Table  | Expansion          |
| -------------- | -------------- | ------------------ |
| 1 video row    | → N clip rows  | Video segmentation |
| 1 document row | → N chunk rows | Text chunking      |
| 1 image row    | → N tile rows  | Image tiling       |

For example, a chunker that splits documents into passages turns a `documents` table into a
`chunks` table, carrying the parent columns into every child row:

**Source: `documents`**

| doc\_id | title         | text                     |
| ------- | ------------- | ------------------------ |
| 1       | "Intro to AI" | "Machine learning is..." |
| 2       | "Data Guide"  | "Data pipelines are..."  |

**Derived: `chunks`** (1:N expansion)

| doc\_id | title         | chunk\_index | chunk\_text           |
| ------- | ------------- | ------------ | --------------------- |
| 1       | "Intro to AI" | 0            | "Machine learning..." |
| 1       | "Intro to AI" | 1            | "Neural networks..."  |
| 1       | "Intro to AI" | 2            | "Training data..."    |
| 2       | "Data Guide"  | 0            | "Data pipelines..."   |
| 2       | "Data Guide"  | 1            | "ETL processes..."    |

Parent columns (`doc_id`, `title`) are inherited automatically; `chunk_index` and
`chunk_text` are generated by the chunker.

## Defining a Chunker

Use the `@chunker` decorator on a function that **yields** output rows. Geneva infers the output schema from the return type annotation.

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyScalarUdtfIterator}
  </CodeBlock>
</CodeGroup>

Input parameters are bound to source columns **by name** — the parameter `video_path` binds to source column `video_path`, just like standard UDFs.

<Tip>
  A chunker can yield **zero rows** for a source row. The source row is still marked as processed and will not be retried on the next refresh.
</Tip>

### List return pattern

If you prefer to build the full list in memory rather than yielding, you can return a `list` instead of an `Iterator`:

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyScalarUdtfList}
  </CodeBlock>
</CodeGroup>

### Batched chunker

For vectorized processing, use `batch=True`. The function receives Arrow arrays and returns a `RecordBatch` of expanded rows. Because the return type `pa.RecordBatch` cannot be inferred, you must supply `output_schema` explicitly:

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyScalarUdtfBatch}
  </CodeBlock>
</CodeGroup>

## Creating a Chunker View

Chunkers use the `create_udtf_view` API (passing the chunker as the `udtf` argument):

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyCreateScalarUdtfView}
  </CodeBlock>
</CodeGroup>

The `query` parameter controls which source columns are inherited. Columns listed in `.select()` are carried into every child row automatically.

## Inherited Columns

Child rows automatically include the parent's columns — no manual join required. The columns available in the child table are determined by the query's `.select()`:

### `videos` table (source)

| video\_path | duration | metadata   |
| ----------- | -------- | ---------- |
| /v/a.mp4    | 120.0    | \{fps: 30} |
| /v/b.mp4    | 60.0     | \{fps: 24} |

### `clips` table (derived, 1:N)

| video\_path | metadata   | clip\_start | clip\_end | clip\_bytes    |
| ----------- | ---------- | ----------- | --------- | -------------- |
| /v/a.mp4    | \{fps: 30} | 0.0         | 10.0      | b"\x00\x1a..." |
| /v/a.mp4    | \{fps: 30} | 10.0        | 20.0      | b"\x00\x2b..." |
| /v/a.mp4    | \{fps: 30} | 20.0        | 30.0      | b"\x00\x3c..." |
|             |            |             |           |                |
| /v/b.mp4    | \{fps: 24} | 0.0         | 10.0      | b"\x00\x4d..." |
| /v/b.mp4    | \{fps: 24} | 10.0        | 20.0      | b"\x00\x5e..." |

The first three rows come from the `/v/a.mp4` source row, the last two from `/v/b.mp4`. Inherited columns (`video_path`, `metadata`) are carried over automatically; `clip_start`, `clip_end`, and `clip_bytes` are generated by the UDTF.

## Adding Computed Columns After Creation

Since chunker views are materialized views, you can add UDF-computed columns to the child table and backfill them:

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyAddColumnsScalarUdtf}
  </CodeBlock>
</CodeGroup>

This is a powerful pattern: expand source rows with a chunker, then enrich the expanded rows with standard UDFs.

## Incremental Refresh

Chunkers support **incremental refresh**, just like standard materialized views:

* **New source rows**: The UDTF runs on new rows, inserting child rows.
* **Deleted source rows**: Child rows linked to the deleted parent are cascade-deleted.
* **Updated source rows**: Old children are deleted, UDTF re-runs, new children inserted.

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyIncrementalRefresh}
  </CodeBlock>
</CodeGroup>

Only the new source rows are processed. Existing clips from previous refreshes are untouched.

## Chaining Chunker Views

Chunker views are standard materialized views, so they can serve as the source for further views:

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyChainingUdtfViews}
  </CodeBlock>
</CodeGroup>

## Full Example: Document Chunking

<CodeGroup>
  <CodeBlock filename="Python" language="python" icon="python">
    {PyDocumentChunkingFull}
  </CodeBlock>
</CodeGroup>

For a comparison of all three function types (UDFs, Chunkers, Batch UDTFs), see [Understanding Transforms](/geneva/udfs).

Reference:

* [`chunker` API](https://lancedb.github.io/geneva/api/udtf/#geneva.chunker)
* [`create_udtf_view` API](https://lancedb.github.io/geneva/api/connection/#geneva.db.Connection.create_udtf_view)
