In this tutorial, you’ll ingest a dataset from Huggingface into your LanceDB Cloud table,
connect to a remote LanceDB cluster and run some search queries.
For interactive code, check out the Python notebook or the TypeScript
example
Getting started
- Sign up for LanceDB Cloud by clicking here.
- Follow this tutorial to create a LanceDB Cloud project.
1. Installation
pip install lancedb datasets
2. Connect to LanceDB
- For LanceDB Cloud users, the database URI (which starts with
db://) and API key can both be retrieved from the LanceDB Cloud UI.
- For LanceDB Enterprise users, please contact us to obtain your database URI, API key, and
host_override URL.
import lancedb
import numpy as np
import pyarrow as pa
import os
# Connect to LanceDB Cloud/Enterprise
uri = "db://your-database-uri"
api_key = "your-api-key"
region = "us-east-1"
# (Optional) For LanceDB Enterprise, set the host override to your enterprise endpoint
host_override = os.environ.get("LANCEDB_HOST_OVERRIDE")
db = lancedb.connect(
uri=uri,
api_key=api_key,
region=region,
host_override=host_override
)
3. Load Dataset
For large datasets, the operation should be performed in batches to optimize memory usage.
Let’s see how it looks when we try to load a larger dataset.
from datasets import load_dataset
# Load a sample dataset from HuggingFace with pre-computed embeddings
sample_dataset = load_dataset("sunhaozhepy/ag_news_sbert_keywords_embeddings", split="test[:1000]")
print(f"Loaded {len(sample_dataset)} samples")
print(f"Sample features: {sample_dataset.features}")
print(f"Column names: {sample_dataset.column_names}")
# Preview the first sample
print(sample_dataset[0])
# Get embedding dimension
vector_dim = len(sample_dataset[0]["keywords_embeddings"])
print(f"Embedding dimension: {vector_dim}")
4. Ingest Data
import pyarrow as pa
# Create a table with the dataset
table_name = "lancedb-cloud-quickstart"
table = db.create_table(table_name, data=sample_dataset, mode="overwrite")
# Convert list to fixedsizelist on the vector column
table.alter_columns(dict(path="keywords_embeddings", data_type=pa.list_(pa.float32(), vector_dim)))
print(f"Table '{table_name}' created successfully")
5. Build an Index
After creating a table with vector data, you’ll want to create an index to enable fast similarity searches. The index creation process optimizes the data structure for efficient vector similarity lookups, significantly improving query performance for large datasets.
Unlike in LanceDB OSS, the create_index/createIndex operation executes asynchronously in LanceDB Cloud/Enterprise. To ensure the index is fully built, you can use the wait_timeout parameter or call wait_for_index on the table.
from datetime import timedelta
# Create a vector index and wait for it to complete
table.create_index("cosine", vector_column_name="keywords_embeddings", wait_timeout=timedelta(seconds=120))
print(table.index_stats("keywords_embeddings_idx"))
6. Vector Search
Once you have created and indexed your table, you can perform vector similarity searches.
LanceDB provides a flexible search API that allows you to find similar vectors, apply filters, and select specific columns to return. The examples below demonstrate basic vector searches as well as filtered searches that combine vector similarity with traditional SQL-style filtering.
query_dataset = load_dataset("sunhaozhepy/ag_news_sbert_keywords_embeddings", split="test[5000:5001]")
print(f"Query keywords: {query_dataset[0]['keywords']}")
query_embed = query_dataset["keywords_embeddings"][0]
# A vector search
result = (
table.search(query_embed)
.select(["text", "keywords", "label"])
.limit(5)
.to_pandas()
)
print("Search results:")
print(result)
7. Filtered Search
Add filter to your vector search query. Your can use SQL statements, like where for filtering.
filtered_result = (
table.search(query_embed)
.where("label > 2")
.select(["text", "keywords", "label"])
.limit(5)
.to_pandas()
)
print("Filtered search results (label > 2):")
print(filtered_result)
What’s Next?
It’s time to use LanceDB Cloud/Enterprise in your own projects!
We’ve prepared more tutorials for you to continue learning. If you
have any questions, reach out via Discord.