How to ingest data into LanceDB
In this example, we will be fetching movie information from the Open Movie Database (OMDb) API and loading it into a local LanceDB instance. To implement it, you will need an API key for the OMDb API (which can be created freely here).-
Install
dltwith LanceDB extras: -
Inside an empty directory, initialize a
dltproject with:This will add all the files necessary to create adltpipeline that can ingest data from any REST API (ex: OMDb API) and load into LanceDB.dlt has a list of pre-built sources like SQL databases, REST APIs, Google Sheets, Notion etc., that can be used out-of-the-box by runningdlt init <source_name> lancedb. Since dlt is a python library, it is also very easy to modify these pre-built sources or to write your own custom source from scratch. -
Specify necessary credentials and/or embedding model details:
In order to fetch data from the OMDb API, you will need to pass a valid API key into your pipeline. Depending on whether you’re using LanceDB OSS or LanceDB cloud, you also may need to provide the necessary credentials to connect to the LanceDB instance. These can be pasted inside
.dlt/sercrets.toml. dlt’s LanceDB integration also allows you to automatically embed the data during ingestion. Depending on the embedding model chosen, you may need to paste the necessary credentials inside.dlt/sercrets.toml:See here for more information and for a list of available models and model providers. -
Write the pipeline code inside
rest_api_pipeline.py: The following code shows how you can configure dlt’s REST API source to connect to the OMDb API, fetch all movies with the word “godzilla” in the title, and load it into a LanceDB table. The REST API source allows you to pull data from any API with minimal code, to learn more read the dlt docs. The script above will ingest the data into LanceDB as it is, i.e. without creating any embeddings. If we want to embed one of the fields (for example,"Title"that contains the movie titles), then we will use dlt’slancedb_adapterand modify the script as follows:- Add the following import statement:
- Modify the pipeline run like this:
.dlt/secrets.tomlto embed the field"Title". -
Install necessary dependencies:
Note: You may need to install the dependencies for your embedding models separately.
-
Run the pipeline:
Finally, running the following command will ingest the data into your LanceDB instance.