Cubed with Lithops for complex data transformations hits invocation hangup #841
-
|
Hey folks, I'm trying to use When using The gist of the code is here: def write_relative_humidity(t2_path, q2_path, psfc_path):
"""Compute relative humidity from 2m temperature, specific humidity, and surface pressure."""
# open file paths
output_path = t2_path.replace("t2", "rh")
output_path = output_path.replace("cadcat", "cadcat-tmp")
print(f"Processing target: {output_path}")
print(" Opening datasets...")
t2_ds = xr.open_zarr(
t2_path,
chunked_array_type="cubed",
from_array_kwargs={"spec": spec},
)
q2_ds = xr.open_zarr(
q2_path,
chunked_array_type="cubed",
from_array_kwargs={"spec": spec},
)
psfc_ds = xr.open_zarr(
psfc_path,
chunked_array_type="cubed",
from_array_kwargs={"spec": spec},
) # Select variables
t2 = t2_ds["t2"]
q2 = q2_ds["q2"]
psfc = psfc_ds["psfc"]
print(" Setting up calculation graph...")
# --- Unit Conversions ---
# Convert Temperature from Kelvin to Celsius
t2_c = t2 - 273.15
# Convert Surface Pressure from Pa to hPa (mb)
psfc_hpa = psfc / 100.0
# Convert Specific Humidity from kg/kg to g/kg
q2_gkg = q2 * 1000.0
# Calculates saturated vapor pressure
e_s = 6.11 * 10 ** (7.5 * (t2_c / (237.7 + t2_c)))
# calculate saturation mixing ratio, unit is g/kg
w_s = 621.97 * (e_s / (psfc_hpa - e_s))
# Calculates relative humidity, unit is 0 to 100
rel_hum = 100 * (q2_gkg / w_s)
# Create constants with the same spec to avoid Spec mismatch errors
c_0_5 = xr.DataArray(cubed.from_array(np.array(0.5), spec=spec, chunks=()))
c_100 = xr.DataArray(cubed.from_array(np.array(100), spec=spec, chunks=()))
# Reset unrealistically low relative humidity values
# Lowest recorded relative humidity value in CA is 0.8%
rel_hum = xr.where(rel_hum > 0.5, rel_hum, c_0_5)
# Reset values above 100 to 100
rel_hum = xr.where(rel_hum < 100, rel_hum, c_100)
# Reassign coordinate attributes
for coord in list(rel_hum.coords):
if coord in t2.coords:
rel_hum[coord].attrs = t2[coord].attrs
# Assign descriptive name
rel_hum.name = "rh"
rel_hum.attrs["units"] = "[0 to 100]"
print(" Executing and writing to Zarr...")
# Use RichProgressBar for visualization if running in interactive terminal
rel_hum.to_zarr(output_path, mode="w", zarr_format=2)
return rel_hum.nbytes |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 6 replies
-
|
Hi @neilSchroeder - thanks for the report. Can you share the plan visualization svg ( It would be useful to look at the lithops logs to get clue to what's happening. Have you set |
Beta Was this translation helpful? Give feedback.
-
|
Hey Tom, thanks so much for the quick reply. I'm struggling to get the plan to visualize. And it looks like that's on I have tried to write a partial selection of ~30 years worth of 3 km resolution data instead of ~75 years and that also hangs. Here's the log files. Apologies for the large quantity. Most of these are fairly small jobs with either 1 or 64 function activations, and then |
Beta Was this translation helpful? Give feedback.
-
|
Fascinating output here. Based on the output of the |
Beta Was this translation helpful? Give feedback.
-
|
Okay, I think I finally made some progress debugging this. Lithops
|
Beta Was this translation helpful? Give feedback.
-
|
I made a local patch to 1. Fixed # Before (bug):
def is_alive(self):
self.monitor.is_alive() # Returns None!
# After (fixed):
def is_alive(self):
return self.monitor.is_alive()2. Fixed # Before: Exits immediately when _all_ready() returns True
while not self._all_ready():
...
# After: Requires 3 consecutive True checks (~3 seconds window)
consecutive_all_ready = 0
while self.should_run:
...
if self._all_ready():
consecutive_all_ready += 1
if consecutive_all_ready >= 3:
break
else:
consecutive_all_ready = 0Now I'm able to process 671 GB in ~14 minutes (48 GB/min throughput) |
Beta Was this translation helpful? Give feedback.
-
|
A more robust solution is to make sure the monitor never exits on its own and only exits when explicitly told to. while self.should_run:
...
# No exit condition - just keep polling until doneThe monitor thread stays alive slightly longer than strictly necessary, but for me the reliability gain is worth the trade-off |
Beta Was this translation helpful? Give feedback.
-
|
@neilSchroeder That's amazing! Thanks for tracking down and coming up with a fix for this problem! Do you have any thoughts about how we could improve things on the Cubed side to make it easier to diagnose problems like this in the future? Perhaps improve the docs on how to configure Lithops to get logs out of it? |
Beta Was this translation helpful? Give feedback.
-
|
I think this discussion is probably mostly closed at this point. Lithops has merged a fix for this. |
Beta Was this translation helpful? Give feedback.

Okay, I think I finally made some progress debugging this.
Lithops
StorageMonitorrace conditionThe lithops storage monitor appears to have a race condition that results in miscounting "Done" tasks across jobs within an
ExecutorIDsession.I ran with the logging for lithops set to DEBUG and found these lines in the output:
2025-11-29 16:36:49,038 [DEBUG] monitor.py:147 -- ExecutorID a31010-20 - Pending: 0 - Running: 0 - Done: 3328 ... [a bunch of invoker/wait/futures/etc. outputs here]... # the very next monitor statements say: 2025-11-29 16:37:00,166 [DEBUG] monitor.py:147 -- ExecutorID a31010-20 - Pending: 3226 - Running: 54 - Done: 3376 2025-11-29 16:37:00,175 [DEBUG] monitor.py:481 …