Converting functions into UDFs
Converting your Python code to a Geneva UDF is simple. There are three kinds of UDFs that you can provide — scalar UDFs, batched UDFs and stateful UDFs. In all cases, Geneva uses Python type hints from your functions to infer the input and output arrow data types that LanceDB uses.Scalar UDFs
The simplest form is a scalar UDF, which processes one row at a time:@udf wrapper is all that is needed.
Batched UDFs
For better performance, you can also define batch UDFs that process multiple rows at once. You can usepyarrow.Arrays:
pyarrow.RecordBatch:
Note: Batch UDFS require you to specifydata_typein the@udfdecorator for batched UDFs which definespyarrow.DataTypeof the returnedpyarrow.Array.
Stateful UDFs
You can also define a stateful UDF that retains its state across calls. This can be used to share code and parameterize your UDFs. In the example below, the model being used is a parameter that can be specified at UDF registration time. It can also be used to paramterize input column names ofpa.RecordBatch batch UDFS.
This also can be used to optimize expensive initialization that may require heavy resource on the distributed workers. For example, this can be used to load an model to the GPU once for all records sent to a worker instead of once per record or per batch of records.
A stateful UDF is a Callable class, with __call__() method. The call method can be a scalar function or a batched function.
Note: The state is will be independently managed on each distributed Worker.
UDF options
Theudf can have extra annotations that specify resource requirements and operational characteristics.
These are just add parameters to the udf(...).
Resources requirements for UDFs
Some workers may require specific resources such as gpus, cpus and certain amounts of RAM. You can provide these requirements by addingnum_cpus, num_gpus, and memory parameters to the UDF.
Operational parameters for UDFs
UDFs can be quite varied — some can be simple operations where thousands of calls can be completed per second, while others may be slow and require 30s per row. In LanceDB, the default number of rows per fragment is 1024 * 1024 rows. Conside a captioning UDF that takes 30s per row. It could take a year (!) before any results show up! (e.g. 30s/row * 1024*1024 rows/fragment => 30M s/fragment => 8.3k hours/fragment -> 347 days/fragment). To enable these to be parallelized, we provide abatch_size setting so the work can be split between workers and so that that partial results are checkpointed more frequently to enable finer-grained progress and job recovery.
By default batch_size is 100 computed rows per checkpoint. So for an expensive captioning UDF that can take 30s per row, you may get a checkpoint every 3000s (50mins). With 100 gpus, our job could finish in 3.5 days! For cheap operations that can compute 100 rows per second you’d potentially be checkpointing every second. Tuning this can help you see progress more frequently.
Registering Features with UDFs
Registering a feature is done by providing theTable.add_columns() function a new column name and the Geneva UDF.
Let’s start by obtaining the table tbl
udf annotations
Changing data in computed columns
Let’s say you backfilled data with your UDF then you noticed that your data has some issues. Here are a few scenarios:- All the values are incorrect due to a bug in the UDF.
- Most values are correct but some values are incorrect due to a failure in UDF execution.
- Values calculated correctly and you want to perform a second pass to fixup some of the values.
alter_table and then backfill.
In scenario 2, you’ll most likely want to re-execute backfill to fill in the values. If the error is in your code (certain cases not handled), you can modify the UDF, and perform an alter_table, and then backfill with some filters.
In scenario 3, you have a few options. A) You could alter your UDF and include the fixup operations in the UDF. You’d alter_table and then backfill recalculating all the values. B) You could have a chain of computed columns — create a new column, calculate the “fixed” up values and have your application use the new column or a combination of the original column. This is similar to A but does not recalulate A and can incur more storage. C) You could update the values in the the column with the fixed up values. This may be expedient but also sacrifices reproducability.
The next section shows you how to change your column definition by altering the UDF.
Altering UDFs
You now want to revise the code. To make the change, you’d update the UDF used to compute the column using thealter_columns API and the updated function. The example below replaces the definition of column area to use the area_udf_v2 function.
backfill operation, all values would be recalculated and updated. If you only wanted some rows updated , you could perform a filtered backfill, targeting the specific rows that need the new upates.
For example, this filter would only update the rows where area was currently null.