In LanceDB Enterprise,
create_fts_index API returns immediately, but index building happens asynchronously.Creating FTS Indexes
Synchronous API
Usecreate_fts_index with synchronous LanceDB connections:
Check FTS index status using the API:
wait_for_index(...) waits until the named FTS index exists and index_stats(...) reports num_unindexed_rows == 0. It can time out if writes keep adding rows faster than the index catches up. If a table has multiple FTS indexes, specify the target text column when querying instead of relying on implicit selection.
Asynchronous API
When using async connections (connect_async), use create_index with the FTS configuration:
The
create_fts_index method is not available on AsyncTable. Use create_index with FTS config instead.Nested field paths
FTS indexes can target text leaves inside struct columns by passing a dotted path (for example,payload.text). The same path works for MatchQuery and PhraseQuery, and for the columns argument on async nearest_to_text queries.
You can point an index at any string leaf nested in a struct, regardless of depth. The struct container itself isn’t indexable: you have to name a specific text field.
LanceDB rejects paths that don’t resolve to a text leaf:
- A struct container (for example,
payload): raisesValueError: FTS index cannot be created .... - A non-text leaf such as an integer or float (for example,
payload.count): raises the same error. - A path that doesn’t exist in the schema (for example,
payload.missing): raisesValueError: Field path ... not found.
create_index:
Python
Configuration Options
FTS Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
with_position | bool | False | Store token positions (required for phrase queries) |
base_tokenizer | str | "simple" | Text splitting method (simple, whitespace, raw, ngram, icu, jieba/*, or lindera/*) |
language | str | "English" | Language for stemming and stop-word filters. Choose CJK and mixed-language segmentation with base_tokenizer. |
max_token_length | int | 40 | Maximum token size; longer tokens are omitted |
lower_case | bool | True | Lowercase tokens |
stem | bool | True | Apply stemming (running → run) |
remove_stop_words | bool | True | Drop common stop words |
ascii_folding | bool | True | Normalize accented characters |
custom_stop_words | list[str] | None | Extra stop words to drop in addition to the language defaults. Requires remove_stop_words=True. |
ngram_min_length | int | 3 | Minimum n-gram length. Applies only when base_tokenizer="ngram". |
ngram_max_length | int | 3 | Maximum n-gram length. Applies only when base_tokenizer="ngram". |
prefix_only | bool | False | Index only prefix n-grams rather than all substrings. Applies only when base_tokenizer="ngram". |
max_token_lengthcan filter out base64 blobs or long URLs.- Disabling
with_positionreduces index size but disables phrase queries. ascii_foldinghelps with international text (e.g., “café” → “cafe”).
Tokenizer choices
base_tokenizer controls segmentation before token filters run:
simple,whitespace, andrawcover common tokenization strategies for space-delimited text.ngramindexes overlapping character spans for substring-style matching.icuuses bundled ICU4X word segmentation for mixed-language text and scripts where whitespace splitting is not enough. ICU stands for International Components for Unicode, and this tokenizer does not need external model files.jieba/*is for Chinese word segmentation with Jieba.lindera/*loads a compiled Lindera dictionary, such aslindera/ipadicfor Japanese orlindera/ko-dicfor Korean.
jieba/default, lindera/ipadic, and lindera/ko-dic require tokenizer model files in Lance’s language model home. Lance looks under the default platform data directory for lance/language_models, or you can set LANCE_LANGUAGE_MODEL_HOME to point to another model root. For example, jieba/default is resolved under <model-home>/jieba/default/....
language is used by token filters, not by the base tokenizer. Stemming supports Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, and Turkish. Built-in stop-word removal supports Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish. For other stemming languages, set remove_stop_words=False or pass custom_stop_words.
Phrase Query Configuration
Enable phrase queries by setting:| Parameter | Required Value | Purpose |
|---|---|---|
with_position | True | Track token positions for phrase matching |
remove_stop_words | False | Preserve stop words for exact phrase matching |
Indexing nested string fields
You can build an FTS index on a string field inside a struct by passing its full dotted path, likenested.text. The same path is used when you query the index through fts_columns, and the indexed column is reported back as the full path from list_indices().
Use the canonical Lance path: dot-separate each struct field from root to leaf (for example,
metadata.author.name). The same convention applies to scalar and vector indexes.