Stateful UDFs are the most common source of worker memory pressure in Geneva. Unlike scalar UDFs, a stateful UDF instance lives for the entire lifetime of a Ray actor, processing many batches in sequence. Anything yourDocumentation Index
Fetch the complete documentation index at: https://docs.lancedb.com/llms.txt
Use this file to discover all available pages before exploring further.
setup() allocates is held for the duration of the job, and anything __call__ retains accumulates batch after batch — sometimes silently, until a worker OOMs partway through a large backfill.
This page shows you how to profile a stateful UDF locally with memray, what to look for, and the common patterns that leak.
Why stateful UDFs leak
A stateful UDF in Geneva is a class: A few facts make memory behavior easy to get wrong:- One instance per worker, many batches. Geneva instantiates the class once per Ray actor. The same
selfprocesses every batch routed to that worker — potentially thousands. setup()runs once. Whatever you allocate there stays in memory until the actor dies. That’s intentional for things like ML models, but it’s a footgun for “lazy” caches that grow.self.<attr>survives across calls. Anything you attach toselfinside__call__is retained for the rest of the actor’s life.- Workers don’t restart between batches. Unlike a serverless function, you don’t get a fresh process per invocation. Memory accumulates linearly with batch count until the worker hits its memory cap.
When to profile
Profile your UDF if any of these are true:- The UDF loads a model, builds an index, or otherwise allocates more than ~100 MiB in
setup(). - The UDF maintains a cache, deduplication table, or running statistic in
self. - A worker is being OOM-killed during backfill (look for
FatalWorkerOOMError, see Job troubleshooting). - Worker RSS grows steadily during a backfill rather than staying flat after
setup().
self state) or UDFs that only ever read from self — those can’t leak by construction.
Profiling a UDF with memray
memray ships as adev dependency in Geneva, so it’s already in your environment if you installed with uv sync.
The trick to profiling under Ray is that workers run in separate processes. Wrapping pytest with memray run only sees the driver, not the actors that actually run your UDF. The cleanest pattern is to have the UDF instrument itself, controlled by an environment variable that’s only set when you want a profile.
Step 1 — add an opt-in tracker to your UDF
The tracker is a no-op whenMY_UDF_MEMRAY_OUT_DIR isn’t set, so leaving this code in your UDF is safe for production runs.
Step 2 — propagate the env var to Ray workers
Ray workers don’t inherit driver environment variables by default. When you start a local Ray cluster from Geneva, passextra_env so the variable reaches each worker:
Step 3 — read the trace
When backfill finishes, you’ll have one (or more)memray-<pid>-<uuid>.bin files under /tmp/my-udf-profile/. Render and inspect them:
What the numbers mean
memray reports two values you’ll care about most:- Peak heap (
metadata.peak_memory) — the high-water mark. This is what triggers OOMs. A peak well above yoursetup()allocations means a batch transiently doubles memory before freeing. - Leaked allocations (
get_leaked_allocation_records()) — what was still allocated when the tracker ended. This is not necessarily a bug — yoursetup()model is “leaked” in this sense because it lives the actor’s lifetime. The signal is how much above expected baseline is leaked.
__call__ rising as you scroll through time — that’s the leak.
Going deeper: RSS vs Arrow allocations
memray gives you the Python-side allocation story. For real diagnosis of “where is the worker’s memory actually going?”, it pays to watch process RSS and Arrow’s own allocator side-by-side — together they tell you which subsystem owns the bytes, often faster than reading a flamegraph. Drop this into your UDF (or anywhere on the worker) to log a snapshot: The three numbers answer different questions:rss_mb— every byte the OS has handed this Python interpreter. Includes Python heap, Arrow, native libraries, and pages the C allocator (glibc/jemalloc) is holding even though Python freed them. This is what triggers cgroup OOM-kills.arrow_live_mb— bytes currently held by live PyArrow buffers (RecordBatch,Array,ChunkedArray, etc.). Goes up when you create Arrow data, down when those references are dropped.gap_mb= rss − arrow_live — “everything else.” This is the Python heap (your ownself.cache, model weights, dicts, lists), native libraries (PyTorch, ONNX), and allocator retention.
Diagnostic patterns
Log the breakdown every few batches and the shape of growth over time tells you which subsystem to fix:| Pattern | Diagnosis | First thing to try |
|---|---|---|
rss climbs slowly, arrow_live flat near zero, big growing gap | Allocator retention — Python freed it but glibc is keeping the pages | ctypes.CDLL("libc.so.6").malloc_trim(0) periodically (Linux only); or set MALLOC_TRIM_THRESHOLD_=131072 |
rss climbs, arrow_live climbs in lockstep | Real Arrow leak — your code is holding RecordBatch / Array references | Find where you’re appending batches to self, or where checkpoint / error payloads aren’t being released |
rss spikes hugely on a few calls then settles, eventually one spike OOMs | Peak is too big, not a leak — a single call allocates more than the worker has | Shrink batch_size, blob_read_buffer_size, or split the work |
rss flat for hours then sudden cliff upward | One pathological row — usually one huge blob (a 4K-resolution image, a 50 MB PDF) | Find the offending row by ID; add a size check at the top of __call__ |
rss rises during setup(), then flat for the whole run, gap constant | Healthy — that’s your model loaded once per actor | Nothing to do |
Common leak patterns
1. The growing cache
Looks harmless. Fine on a unit test with 10 inputs. Catastrophic on a backfill of 10M rows, where most inputs are unique and the cache grows to fill the worker. Fix: Use a bounded cache (functools.lru_cache with maxsize, or a manual size cap), or skip caching when you don’t know the cardinality.
2. Accumulating per-call buffers
Fix: Don’t hold references to inputs past the return of__call__. If you need rolling state, summarize into a small aggregate (counts, sums) instead of holding the raw batches.
3. Closures capturing batch arrays
Fix: Extract only the small values you actually need into the closure, or execute the work eagerly.4. ML model state that grows
Some ML libraries retain per-call state internally (KV caches, gradient buffers, autograd graphs). If you’re using PyTorch: For Hugging Face pipelines, ensure you’re ineval() mode and not accumulating gradients. For long-running stateful UDFs on GPUs, also see torch.cuda.empty_cache() between large batches.
A confidence check
A useful “does my profiling actually work?” sanity check: temporarily introduce a deliberate leak and confirm memray catches it. Ifmemray summary doesn’t show leaked bytes growing roughly with batch count after this change, your tracker isn’t actually attached (most often: the env var isn’t reaching workers — re-check extra_env).
Geneva’s own test suite ships a reference implementation of this pattern in src/stress_tests/_memray_probe.py and src/stress_tests/test_memray_stateful_udf.py, plus a GitHub Actions workflow (memray-stateful-udf-profile.yml) that uploads the per-actor .bin and rendered flamegraph as a CI artifact. Feel free to copy that scaffolding for your own project’s UDFs.
Related
- UDFs — defining stateful UDFs
- Job troubleshooting — diagnosing OOMs and other worker errors
- Advanced configuration — admission control and resource limits
- memray documentation — flamegraph, summary, and tree report formats