Common Issues to Verify
Here are some areas to verify to identify the source of problems with your Geneva deployment:- Versions compatibility (Ray, Python, Lance)
- Remote Ray execution and hardware resource availability
- Sufficient permissions to access data
- Worker code only returns serializable values (no open files, no GPU resident buffers)
Confirming Dependency Versions
Geneva uses Ray for distributed execution. Ray requires the version deployed cluster services and clients to be exactly the same. Minor versions of Python must match both on client and cluster services (e.g. 3.10.3 and 3.10.5 are ok, but 3.10.3 and 3.12.1 are not.) Geneva has been tested with Ray 2.44+ and Python 3.10.x and 3.12.x. You can run this code in your notebook to verify your environment matches your expectations:Confirming Remote Ray Execution
Geneva allows you to specify the resources of your worker nodes. You can verify that your cluster has the resources (e.g. GPUs) available for your jobs and that remote execution is working properly. You can get some basic information about resources available to your Ray:Note: You should execute Geneva code from a machine or VM that has the same architecture and OS type as the nodes in your cluster. This will allow for shared libraries to be shipped. For example, if you use a Mac to host a Jupyter notebook, Geneva will push Mac libraries to your Linux cluster and likely result in module not found errors due to OS/architecture differences.For GPU-dependent UDFs and jobs, you can verify that GPU worker nodes have the CUDA library:
Confirming Sufficient Permissions
While your notebook or working environment may have credentials to read and write to particular buckets, your jobs need sufficient rights to read and write to them as well. Addingimport geneva to any remote function can help verify that your workers have sufficient grants.
Here we add import geneva to help trigger potential permissions problems:
GCE Permissions Errors in Job Logs
If you are using Geneva managed Ray deployed on GKE, the errors may look like this:service_account="geneva-integ-test" below:
Serialization Errors
Serialization is a critical subsystem of Geneva. In order to store UDFs and perform distributed execution, both code and data must be serializable. Errors in this area can be subtle and difficult to find. There are a few basic rules:- Python objects passed to distributed processes or written to LanceDB must be able to be pickled or unpickled using the Python pickle or cloudpickle library.
- Python code used for distributed execution, including UDFs used to calculate values written to columns must be able to be pickled or unpickled using the Python pickle or cloudpickle library.
- Python code or objects need to have the same encoding and representation on the client-side and the server-side.
Serialization Library Mismatches
Any Python code and objects must be able to be serialized by the client and deserialized on the server side, and vice versa. This includes objects that are generated on the fly such as those using theattrs library.
The distributed processing engine Geneva uses, Ray, also depends on the attrs library. Different versions may create different object signatures that are not compatible when shipped from client-side to server-side and vice versa. This means you’ll need to have compatible versions of this library on both sides.
Here’s an example error message. It is subtle and does not directly point to the attrs library:
attrs module on the client side to use the same version found on the server side.
Objects with Unserializable Elements
Python objects may have internal references to unpickleable objects such as open file handles or open network clients with machine specific state. There are two strategies here:- Remove the reference to unpickleable objects.
- Keep objects with unserializable state only on the client or only on the server. This could be moving clients into the UDF function, or converting objects into serializable versions before transmitting them.
Disconnect or Serialization Errors with GPU Dependent UDFs
When using GPU code, the typical process loads some values and tensors from CPU memory to GPU memory. Even after moving data (data.cpu().tolist()), there may be references to GPU memory. While this is not a problem with local execution, when doing a distributed job it may cause problems because the GPU references are not serializable and not needed. You must take steps to eliminate references to structures in GPU memory since they cannot be serialized and sent between workers. This can be achieved by explicitly disconnecting references to the GPU memory (data.cpu().detach().tolist()) to get only-CPU resident fully serializable objects.
Here are some typical error messages: