> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Full-Text Search (FTS) Index

> Create and tune BM25-based full-text search indexes in LanceDB.

export const FtsIndexWait = "table_name = \"fts-index-wait\"\n\ntable = db.open_table(table_name)\ntable.create_fts_index(\"text\")\n\nindex_name = \"text_idx\"\ntable.wait_for_index([index_name])\n";

export const FtsIndexNested = "from lancedb.query import MatchQuery, PhraseQuery\n\ntable = db.open_table(\"fts-index-nested\")\n\n# Index a text leaf inside a struct column using a dotted path.\ntable.create_fts_index(\"payload.text\", with_position=True)\n\n# The same dotted path works in MatchQuery and PhraseQuery.\nmatches = (\n    table.search(MatchQuery(\"puppy\", \"payload.text\")).limit(5).to_list()\n)\nphrases = (\n    table.search(PhraseQuery(\"puppy runs\", \"payload.text\"))\n    .limit(5)\n    .to_list()\n)\n";

export const FtsIndexCreate = "table_name = \"fts-index-create\"\ntable = db.open_table(table_name)\ntable.create_fts_index(\"text\")\n";

export const FtsIndexAsync = "import asyncio\n\nimport lancedb\nimport polars as pl\nfrom lancedb.index import FTS\n\ndata = pl.DataFrame(\n    {\n        \"id\": [1, 2],\n        \"text\": [\n            \"His first language is spanish\",\n            \"Her first language is english\",\n        ],\n    }\n)\n\nasync def main(data: pl.DataFrame):\n    uri = \"ex_lancedb\"\n    db = await lancedb.connect_async(uri)\n    tbl = await db.create_table(\"my_text\", data=data, mode=\"overwrite\")\n\n    await tbl.create_index(\"text\", config=FTS(language=\"English\"))\n\n    response = await tbl.search(\"spanish\", query_type=\"fts\")\n    result = await response.limit(1).to_polars()\n    print(result)\n    return result\n\nif __name__ == \"__main__\":\n    asyncio.run(main(data))\n";

LanceDB provides performant full-text search based on BM25, allowing you to incorporate keyword-based search in your retrieval solutions. This page shows
examples on how to create and configure FTS indexes in LanceDB OSS and Enterprise, using the synchronous and asynchronous APIs.

<Note>
  In LanceDB Enterprise, `create_fts_index` API returns immediately, but index building happens asynchronously.
</Note>

## Creating FTS Indexes

### Synchronous API

Use `create_fts_index` with synchronous LanceDB connections:

<CodeGroup>
  <CodeBlock filename="Python" language="Python" icon="python">
    {FtsIndexCreate}
  </CodeBlock>
</CodeGroup>

Check FTS index status using the API:

<CodeGroup>
  <CodeBlock filename="Python" language="Python" icon="python">
    {FtsIndexWait}
  </CodeBlock>
</CodeGroup>

`wait_for_index(...)` waits until the named FTS index exists and `index_stats(...)` reports `num_unindexed_rows == 0`. It can time out if writes keep adding rows faster than the index catches up. If a table has multiple FTS indexes, specify the target text column when querying instead of relying on implicit selection.

### Asynchronous API

When using async connections (`connect_async`), use `create_index` with the `FTS` configuration:

<CodeGroup>
  <CodeBlock filename="Python" language="Python" icon="python">
    {FtsIndexAsync}
  </CodeBlock>
</CodeGroup>

<Note>
  The `create_fts_index` method is not available on `AsyncTable`. Use `create_index` with `FTS` config instead.
</Note>

## Nested field paths

FTS indexes can target text leaves inside struct columns by passing a dotted path (for example, `payload.text`). The same path works for [`MatchQuery`](/search/full-text-search) and [`PhraseQuery`](/search/full-text-search), and for the `columns` argument on async `nearest_to_text` queries.

You can point an index at any string leaf nested in a struct, regardless of depth. The struct container itself isn't indexable: you have to name a specific text field.

<CodeGroup>
  <CodeBlock filename="Python" language="Python" icon="python">
    {FtsIndexNested}
  </CodeBlock>
</CodeGroup>

LanceDB rejects paths that don't resolve to a text leaf:

* A struct container (for example, `payload`): raises `ValueError: FTS index cannot be created ...`.
* A non-text leaf such as an integer or float (for example, `payload.count`): raises the same error.
* A path that doesn't exist in the schema (for example, `payload.missing`): raises `ValueError: Field path ... not found`.

The async API accepts the same dotted paths through `create_index`:

```python Python icon="python" theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
from lancedb.index import FTS

await async_table.create_index("payload.text", config=FTS(with_position=True))
```

## Configuration Options

### FTS Parameters

| Parameter           | Type       | Default     | Description                                                                                                    |
| :------------------ | :--------- | :---------- | :------------------------------------------------------------------------------------------------------------- |
| `with_position`     | bool       | `False`     | Store token positions (required for phrase queries)                                                            |
| `base_tokenizer`    | str        | `"simple"`  | Text splitting method (`simple`, `whitespace`, `raw`, `ngram`, `icu`, `jieba/*`, or `lindera/*`)               |
| `language`          | str        | `"English"` | Language for stemming and stop-word filters. Choose CJK and mixed-language segmentation with `base_tokenizer`. |
| `max_token_length`  | int        | `40`        | Maximum token size; longer tokens are omitted                                                                  |
| `lower_case`        | bool       | `True`      | Lowercase tokens                                                                                               |
| `stem`              | bool       | `True`      | Apply stemming (`running` → `run`)                                                                             |
| `remove_stop_words` | bool       | `True`      | Drop common stop words                                                                                         |
| `ascii_folding`     | bool       | `True`      | Normalize accented characters                                                                                  |
| `custom_stop_words` | list\[str] | `None`      | Extra stop words to drop in addition to the language defaults. Requires `remove_stop_words=True`.              |
| `ngram_min_length`  | int        | `3`         | Minimum n-gram length. Applies only when `base_tokenizer="ngram"`.                                             |
| `ngram_max_length`  | int        | `3`         | Maximum n-gram length. Applies only when `base_tokenizer="ngram"`.                                             |
| `prefix_only`       | bool       | `False`     | Index only prefix n-grams rather than all substrings. Applies only when `base_tokenizer="ngram"`.              |

<Note title="Key parameters">
  * `max_token_length` can filter out base64 blobs or long URLs.
  * Disabling `with_position` reduces index size but disables phrase queries.
  * `ascii_folding` helps with international text (e.g., “café” → “cafe”).
</Note>

### Tokenizer choices

`base_tokenizer` controls segmentation before token filters run:

* `simple`, `whitespace`, and `raw` cover common tokenization strategies for space-delimited text.
* `ngram` indexes overlapping character spans for substring-style matching.
* `icu` uses bundled ICU4X word segmentation for mixed-language text and scripts where whitespace splitting is not enough. ICU stands for [International Components for Unicode](https://icu.unicode.org/), and this tokenizer does not need external model files.
* `jieba/*` is for Chinese word segmentation with Jieba.
* `lindera/*` loads a compiled Lindera dictionary, such as `lindera/ipadic` for Japanese or `lindera/ko-dic` for Korean.

Model-backed tokenizers such as `jieba/default`, `lindera/ipadic`, and `lindera/ko-dic` require tokenizer model files in Lance's language model home. Lance looks under the default platform data directory for `lance/language_models`, or you can set `LANCE_LANGUAGE_MODEL_HOME` to point to another model root. For example, `jieba/default` is resolved under `<model-home>/jieba/default/...`.

`language` is used by token filters, not by the base tokenizer. Stemming supports Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, and Turkish. Built-in stop-word removal supports Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish. For other stemming languages, set `remove_stop_words=False` or pass `custom_stop_words`.

### Phrase Query Configuration

Enable phrase queries by setting:

| Parameter           | Required Value | Purpose                                       |
| :------------------ | :------------- | :-------------------------------------------- |
| `with_position`     | `True`         | Track token positions for phrase matching     |
| `remove_stop_words` | `False`        | Preserve stop words for exact phrase matching |

## Indexing nested string fields

You can build an FTS index on a string field inside a struct by passing its full dotted path, like `nested.text`. The same path is used when you query the index through `fts_columns`, and the indexed column is reported back as the full path from `list_indices()`.

```python theme={"theme":{"light":"vitesse-light","dark":"catppuccin-mocha"}}
# Schema: pa.struct([pa.field("text", pa.string())]) stored under the `nested` column.
table.create_fts_index("nested.text")

results = (
    table.search("puppy", query_type="fts", fts_columns="nested.text")
    .limit(5)
    .to_list()
)
```

<Note>
  Use the canonical Lance path: dot-separate each struct field from root to leaf (for example, `metadata.author.name`). The same convention applies to scalar and vector indexes.
</Note>
