[batch] Use gcloud to localize large input files from GCS#15350

Open

kush-chandra wants to merge 2 commits intohail-is:mainfrom

kush-chandra:kchandra-copier-memory-fix

Contributor

kush-chandra commented Mar 19, 2026

Change Description

Fixes #15011

This is a redo of #15273 which addresses the scalability issues with that approach:

We now only use the chunked gcloud downloader if the file is large enough to benefit. Files below the chunk size will be downloaded in a single process, as before.
Both the gcloud download paths are now bounded by thread count & memory buffer usage, preventing memory errors on lowmem VMs.

The timing improvements for large files remain the same as in the original change, and there is now no spike in latency when downloading hundreds of smaller files.

Security Assessment

This change potentially impacts the Hail Batch instance as deployed by Broad Institute in GCP

Impact Rating

This change has a medium security impact

Impact Description

This change instantiates a new Google client to download files from GCS buckets. It uses the same local credential files as the existing custom clients we've created. I manually verified the problem cases with the original change now work as intended and am working on adding additional testing for those paths.

Appsec Review

Required: The impact has been assessed and approved by appsec


          [batch] Use gcloud storage to download input files from GCP (hail-is#…

0c71d8b

…15273)

## Change Description

Fixes hail-is#15011 

Google vends a tool for parallelized file downloads in their Python
client library:
https://docs.cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.transfer_manager.html
Using this instead of our copier code to download from GCS buckets to
local storage lets us download larger files which our copier code cannot
handle without frequent timeouts & failures.

I compared the performance when a job downloads an input file using the
copier vs using the Google library:

| File size | Copier code | transfer_manager |
|--------|--------|--------|
| 500B | 1s | 0.8s |
| 5M | 0.6s | 0.8s |
| 1G | 50s | 2.7s |
| 5G | 100s | 13.8s |
| 10G | 219s | 27s |
| 21G | failed | 54s |
| 54G | failed | 135s |

Performance is similar for smaller files with variance mostly due to
network bandwidth. At 1G, the copier starts to hit transient errors on
some chunks & retries. Beyond 20G, jobs using the copier failed in the
input stage after an indeterminate length of time.

## Security Assessment

Delete all except the correct answer:
- This change potentially impacts the Hail Batch instance as deployed by
Broad Institute in GCP

### Impact Rating

Delete all except the correct answer:
- This change has a medium security impact

### Impact Description

Replace this content with a description of the impact of the change:

This change instantiates a new Google client to download files from GCS
buckets. It uses the same local credential files as the existing custom
clients we've created.

### Appsec Review

- [x] Required: The impact has been assessed and approved by appsec

---------

Co-authored-by: Chris Llanwarne <cjllanwarne@users.noreply.github.com>

kush-chandra requested a review from a team as a code owner

March 19, 2026 15:00

graphite-app bot reviewed

View reviewed changes

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py Outdated Show resolved Hide resolved

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py Outdated Show resolved Hide resolved

kush-chandra requested a review from cjllanwarne

March 19, 2026 15:08


          Handle normal/large GCP -> local transfers separately

c514da6

kush-chandra force-pushed the kchandra-copier-memory-fix branch from fbc83d5 to c514da6 Compare

March 19, 2026 15:45

cjllanwarne requested changes

View reviewed changes

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py

+                              timeout=self._timeout,
+                          )
+                      except FileNotFoundError:
+                          os.makedirs(os.path.dirname(local_dest), exist_ok=True)

Collaborator

cjllanwarne Mar 19, 2026

Can we check ahead of time that the directory exists (instead of letting it throw an exception that we then have to catch do resubmit the exact same command in the except block?)

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py

+                              max_workers=8,
+                          )
+                      except FileNotFoundError:
+                          os.makedirs(os.path.dirname(local_dest), exist_ok=True)

Collaborator

cjllanwarne Mar 19, 2026

Same thought here

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py

+                          success = False
+                          threads_acquired = 0
+                          try:
+                              for _ in range(0, 8):

Collaborator

cjllanwarne Mar 19, 2026 •

edited

Loading

There's two risky things here:

We already acquired a sema slot from bounded_gather2. Especially with the xfer_sema waits introducing an easy place for tasks to be paused/suspended, it's very possible that lots of tasks get a bounded_gather2 sema slots and so the sema pool is empty before we try to acquire another 8 slots for the download here
Because this acquiring logic is not atomic, it's possible that a thread might pick up a sub-portion of its 8 required slots here, then get suspended, and those reserved slots are locked up in this task which isn't doing anything with them yet. It looks like the weighted semaphor semantics is maybe more what you're looking for here (you can atomically request a number of slots)?

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py

                       )
+                      credentials = kwargs.get('credentials')
+                      if isinstance(credentials, GoogleCredentials):
+                          access_token = credentials.access_token

Collaborator

cjllanwarne Mar 19, 2026 •

edited

Loading

I believe these access tokens have a 1 hour lifespan before they'll be rejected. We can split on the type of google credentials we have and use one of Credentials.from_service_account_info(credentials.key) for service accounts or Credentials(token=None, refresh_token=credentials.credentials['refresh_token'], client_id=credentials.credentials['client_id'], client_secret=credentials.credentials['client_secret'], token_uri='https://oauth2.googleapis.com/token') for application-default credentials, or Client(credentials = None) for automatic application-default credentials.

In fact, credentials.access_token looks like a method (and an async method at that), so this would return a method to produce a coroutine rather than a value, so I'm not sure this code path would work even temporarily. I think we must be going the credentials=None / application-default route below during the on-VM testing, but I'm not sure it'd work if we tried to pass credentials in on a local laptop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet