Cubed with Lithops for complex data transformations hits invocation hangup #841

neilSchroeder · 2025-11-26T23:19:36Z

neilSchroeder
Nov 26, 2025

Hey folks, I'm trying to use cubed to do some slightly more complex data transformations on climate data stored as zarrs.

When using lithops to export this to AWS lambdas it makes decent headway until roughly step 3 in the graph where it tries cues up about 3000 jobs or so, runs through a few of them, and then just silently hangs without submitting more jobs all while continuously printing to the command line that it's waiting for the rest of the jobs in this step to finish. Has anyone run into something like this before? Is there a trick I can use to get around this?

The gist of the code is here:

def write_relative_humidity(t2_path, q2_path, psfc_path):
    """Compute relative humidity from 2m temperature, specific humidity, and surface pressure."""
    # open file paths
    output_path = t2_path.replace("t2", "rh")
    output_path = output_path.replace("cadcat", "cadcat-tmp")

    print(f"Processing target: {output_path}")

    print("  Opening datasets...")
    t2_ds = xr.open_zarr(
        t2_path,
        chunked_array_type="cubed",
        from_array_kwargs={"spec": spec},
    )
    q2_ds = xr.open_zarr(
        q2_path,
        chunked_array_type="cubed",
        from_array_kwargs={"spec": spec},
    )
    psfc_ds = xr.open_zarr(
        psfc_path,
        chunked_array_type="cubed",
        from_array_kwargs={"spec": spec},
    )  # Select variables
    t2 = t2_ds["t2"]
    q2 = q2_ds["q2"]
    psfc = psfc_ds["psfc"]

    print("  Setting up calculation graph...")
    # --- Unit Conversions ---
    # Convert Temperature from Kelvin to Celsius
    t2_c = t2 - 273.15

    # Convert Surface Pressure from Pa to hPa (mb)
    psfc_hpa = psfc / 100.0

    # Convert Specific Humidity from kg/kg to g/kg
    q2_gkg = q2 * 1000.0
    # Calculates saturated vapor pressure
    e_s = 6.11 * 10 ** (7.5 * (t2_c / (237.7 + t2_c)))

    # calculate saturation mixing ratio, unit is g/kg
    w_s = 621.97 * (e_s / (psfc_hpa - e_s))

    # Calculates relative humidity, unit is 0 to 100
    rel_hum = 100 * (q2_gkg / w_s)

    # Create constants with the same spec to avoid Spec mismatch errors
    c_0_5 = xr.DataArray(cubed.from_array(np.array(0.5), spec=spec, chunks=()))
    c_100 = xr.DataArray(cubed.from_array(np.array(100), spec=spec, chunks=()))

    # Reset unrealistically low relative humidity values
    # Lowest recorded relative humidity value in CA is 0.8%
    rel_hum = xr.where(rel_hum > 0.5, rel_hum, c_0_5)

    # Reset values above 100 to 100
    rel_hum = xr.where(rel_hum < 100, rel_hum, c_100)

    # Reassign coordinate attributes
    for coord in list(rel_hum.coords):
        if coord in t2.coords:
            rel_hum[coord].attrs = t2[coord].attrs

    # Assign descriptive name
    rel_hum.name = "rh"
    rel_hum.attrs["units"] = "[0 to 100]"

    print("  Executing and writing to Zarr...")
    # Use RichProgressBar for visualization if running in interactive terminal
    rel_hum.to_zarr(output_path, mode="w", zarr_format=2)

    return rel_hum.nbytes

Answered by neilSchroeder

Nov 29, 2025

Okay, I think I finally made some progress debugging this.

Lithops `StorageMonitor` race condition

The lithops storage monitor appears to have a race condition that results in miscounting "Done" tasks across jobs within an ExecutorID session.

I ran with the logging for lithops set to DEBUG and found these lines in the output:

2025-11-29 16:36:49,038 [DEBUG] monitor.py:147 -- ExecutorID a31010-20 - Pending: 0 - Running: 0 - Done: 3328
... [a bunch of invoker/wait/futures/etc. outputs here]...

# the very next monitor statements say:
2025-11-29 16:37:00,166 [DEBUG] monitor.py:147 -- ExecutorID a31010-20 - Pending: 3226 - Running: 54 - Done: 3376
2025-11-29 16:37:00,175 [DEBUG] monitor.py:481 …

View full answer

tomwhite · 2025-11-27T11:25:47Z

tomwhite
Nov 27, 2025
Maintainer

Hi @neilSchroeder - thanks for the report.

Can you share the plan visualization svg (rel_hum.data.visualize() should do it). Have you had any luck running with a subset of the dataset?

It would be useful to look at the lithops logs to get clue to what's happening. Have you set lithops_filename (as described in https://github.com/cubed-dev/cubed/blob/main/examples/lithops/aws/README.md, and https://lithops-cloud.github.io/docs/source/configuration.html#lithops-config-keys). There are also log file in the $TMPDIR/lithops-$USER/logs directory on the local machine.

0 replies

neilSchroeder · 2025-11-27T16:09:54Z

neilSchroeder
Nov 27, 2025
Author

Hey Tom, thanks so much for the quick reply.

I'm struggling to get the plan to visualize. And it looks like that's on cubed's end based on this example. I get pretty much the exact same stack trace as shown in the documentation there.

I have tried to write a partial selection of ~30 years worth of 3 km resolution data instead of ~75 years and that also hangs.

Here's the log files. Apologies for the large quantity. Most of these are fairly small jobs with either 1 or 64 function activations, and then 116362-20-M001.log has about 3000 jobs that all successfully run, and then 116362-20-M002.log never even got created, it just stalled out with Waiting for 0% of 3328 function activations to complete.
output_partial_selection.log
116362-20-M001.log
116362-20-M000.log
116362-19-M001.log
116362-19-M000.log
116362-18-M001.log
116362-18-M000.log
116362-17-M001.log
116362-17-M000.log
116362-16-M001.log
116362-16-M000.log
116362-15-M000.log
116362-14-M000.log
116362-13-M000.log
116362-12-M000.log
116362-11-M001.log
116362-11-M000.log
116362-10-M001.log
116362-10-M000.log
116362-9-M001.log
116362-9-M000.log
116362-8-M001.log
116362-8-M000.log
116362-7-M001.log
116362-7-M000.log
116362-6-M001.log
116362-6-M000.log
116362-5-M001.log
116362-5-M000.log
116362-4-M001.log
116362-4-M000.log
116362-3-M001.log
116362-3-M000.log
116362-2-M001.log
116362-2-M000.log
116362-1-M001.log
116362-1-M000.log
116362-0-M001.log
116362-0-M000.log

1 reply

tomwhite Nov 27, 2025
Maintainer

I'm struggling to get the plan to visualize. And it looks like that's on cubed's end based on this example. I get pretty much the exact same stack trace as shown in the documentation there.

Ah that looks like graphviz (and in particular the dot executable) isn't installed. (I opened #842 to fix the docs).

116362-20-M002.log never even got created, it just stalled out with Waiting for 0% of 3328 function activations to complete.

So it looks like that particular operation is not progressing at all. Would be useful to see the plan as that might give a hint.

neilSchroeder · 2025-11-29T03:41:52Z

neilSchroeder
Nov 29, 2025
Author

Fascinating output here.

Based on the output of the RichProgressBar() the graph is getting stuck on op-056:

create-arrays   7/7       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 
op-051 __mul__  3328/3328 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 
op-047 __sub__  3328/3328 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0%
op-056 __rmul__ 8/3328    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.2% 
op-060 __rmul__ 0/3328    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0% 
op-068 __rmul__ 0/3328    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0% 
op-071 __rmul__ 0/3328    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0% 
op-082 where    0/3328    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0%
op-090 store    0/1216    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0%

4 replies

neilSchroeder Nov 29, 2025
Author

I was able to pretty efficiently optimize the graph by allowing more source arrays:

neilSchroeder Nov 29, 2025
Author

And now it hangs up on op-071.

Looking at the monitors on aws lambda it's like it just hits the max submission count and then stops submitting more jobs. I'm not sure if lithops has a timeout or something.

Edit: having tested multiple workers I have come to the conclusion that it's not a max submission count or timeout. It now feels a bit like there's a miscommunication about jobs finishing and submitting more? Or cubed/lithops somehow getting stuck?

neilSchroeder Nov 29, 2025
Author

I have also tried running with various max workers from 100 - 1000 to see if it's just getting swamped or something. This doesn't seem to make much difference as it seems to inevitably hang on op-071.

neilSchroeder Nov 29, 2025
Author

The AWS monitors for my lambda jobs look like this:

The job is currently stuck trying to run op-071. It has processed 15/3328 operations and has been stuck at this point for over 30 minutes, which tells me it's not a timeout issue, and the graphs make it look like all jobs just finish and then there are no more submissions/invocations.

neilSchroeder · 2025-11-29T23:38:24Z

neilSchroeder
Nov 29, 2025
Author

Okay, I think I finally made some progress debugging this.

Lithops `StorageMonitor` race condition

The lithops storage monitor appears to have a race condition that results in miscounting "Done" tasks across jobs within an ExecutorID session.

I ran with the logging for lithops set to DEBUG and found these lines in the output:

2025-11-29 16:36:49,038 [DEBUG] monitor.py:147 -- ExecutorID a31010-20 - Pending: 0 - Running: 0 - Done: 3328
... [a bunch of invoker/wait/futures/etc. outputs here]...

# the very next monitor statements say:
2025-11-29 16:37:00,166 [DEBUG] monitor.py:147 -- ExecutorID a31010-20 - Pending: 3226 - Running: 54 - Done: 3376
2025-11-29 16:37:00,175 [DEBUG] monitor.py:481 -- ExecutorID a31010-20 - Storage job monitor finished

So the lithops monitor appears to be counting "done" tasks across executor sessions, over-counting against the current task set due to the overlap, and exiting early.

Tracking it down:

lithops.monitor::StorageMonitor::run seems to have a race condition when lithops.monitor::Monitor::_all_ready returns True due to the previous job set's futures finishing, leaving current futures orphaned, and the job just hangs indefinitely waiting for futures that were orphaned or never submitted to finish.
cubed.runtime.executors.lithops::execute_dag is exposed because it's running all jobs from inside a single lithops.retries::RetryingFunctionExector context

Lithops Bug Report

lithops-cloud/lithops#1449

0 replies

neilSchroeder · 2025-11-30T16:38:26Z

neilSchroeder
Nov 30, 2025
Author

I made a local patch to lithops to fix the problem:

1. Fixed JobMonitor.is_alive() (line 523) - Missing return statement:

# Before (bug):
def is_alive(self):
    self.monitor.is_alive()  # Returns None!

# After (fixed):
def is_alive(self):
    return self.monitor.is_alive()

2. Fixed StorageMonitor.run() race condition - Require 3 consecutive _all_ready() checks before exiting:

# Before: Exits immediately when _all_ready() returns True
while not self._all_ready():
    ...

# After: Requires 3 consecutive True checks (~3 seconds window)
consecutive_all_ready = 0
while self.should_run:
    ...
    if self._all_ready():
        consecutive_all_ready += 1
        if consecutive_all_ready >= 3:
            break
    else:
        consecutive_all_ready = 0

Now I'm able to process 671 GB in ~14 minutes (48 GB/min throughput)

0 replies

neilSchroeder · 2025-11-30T17:48:28Z

neilSchroeder
Nov 30, 2025
Author

A more robust solution is to make sure the monitor never exits on its own and only exits when explicitly told to.

while self.should_run:
    ...
    # No exit condition - just keep polling until done

The monitor thread stays alive slightly longer than strictly necessary, but for me the reliability gain is worth the trade-off

0 replies

tomwhite · 2025-12-01T14:14:04Z

tomwhite
Dec 1, 2025
Maintainer

@neilSchroeder That's amazing! Thanks for tracking down and coming up with a fix for this problem!

Do you have any thoughts about how we could improve things on the Cubed side to make it easier to diagnose problems like this in the future? Perhaps improve the docs on how to configure Lithops to get logs out of it?

1 reply

neilSchroeder Dec 1, 2025
Author

I think a quick mention about log_level and log_filename in the lithops configuration documentation would definitely help (maybe I just missed them).

And maybe a page about diagnosing common issues? It took me quite a while to understand whether or not this particular issue was cubed related or lithops related - mostly because I didn't understand what common failure points are for cubed or what failures of engines look like relative to failures in cubed. But I understand that's a tricky thing to put together in the first place.

neilSchroeder · 2025-12-05T22:42:25Z

neilSchroeder
Dec 5, 2025
Author

I think this discussion is probably mostly closed at this point. Lithops has merged a fix for this.

0 replies

Cubed with Lithops for complex data transformations hits invocation hangup #841

Uh oh!

Uh oh!

neilSchroeder Nov 26, 2025

Lithops StorageMonitor race condition

Replies: 8 comments · 6 replies

Uh oh!

tomwhite Nov 27, 2025 Maintainer

Uh oh!

neilSchroeder Nov 27, 2025 Author

Uh oh!

tomwhite Nov 27, 2025 Maintainer

Uh oh!

neilSchroeder Nov 29, 2025 Author

Uh oh!

neilSchroeder Nov 29, 2025 Author

Uh oh!

Uh oh!

neilSchroeder Nov 29, 2025 Author

Uh oh!

neilSchroeder Nov 29, 2025 Author

Uh oh!

Uh oh!

neilSchroeder Nov 29, 2025 Author

Uh oh!

neilSchroeder Nov 29, 2025 Author

Lithops StorageMonitor race condition

Tracking it down:

Lithops Bug Report

Uh oh!

neilSchroeder Nov 30, 2025 Author

Uh oh!

neilSchroeder Nov 30, 2025 Author

Uh oh!

tomwhite Dec 1, 2025 Maintainer

Uh oh!

neilSchroeder Dec 1, 2025 Author

Uh oh!

neilSchroeder Dec 5, 2025 Author

neilSchroeder
Nov 26, 2025

Lithops `StorageMonitor` race condition

Replies: 8 comments 6 replies

tomwhite
Nov 27, 2025
Maintainer

neilSchroeder
Nov 27, 2025
Author

tomwhite Nov 27, 2025
Maintainer

neilSchroeder
Nov 29, 2025
Author

neilSchroeder Nov 29, 2025
Author

neilSchroeder Nov 29, 2025
Author

neilSchroeder Nov 29, 2025
Author

neilSchroeder Nov 29, 2025
Author

neilSchroeder
Nov 29, 2025
Author

Lithops `StorageMonitor` race condition

neilSchroeder
Nov 30, 2025
Author

neilSchroeder
Nov 30, 2025
Author

tomwhite
Dec 1, 2025
Maintainer

neilSchroeder Dec 1, 2025
Author

neilSchroeder
Dec 5, 2025
Author