ADR: Global cache by jorgee · Pull Request #6796 · nextflow-io/nextflow

jorgee · 2026-02-03T09:29:24Z

Add ADR for Global Cache Feature

This PR introduces an Architecture Decision Record (ADR) for a global cache system that enables cross-pipeline task result sharing in Nextflow.

Overview

The global cache extends Nextflow's existing task caching mechanism to allow different pipeline executions to reuse computational results across:

Different users running the same analysis
Development and production pipeline versions
Parameter sweeps with shared preprocessing steps
Multi-environment deployments

Key Design Decisions

Architecture:

Builds on existing nf-cloudcache plugin infrastructure
Uses cloud object storage (S3, GCS, Azure Blob) as backend
Leverages strong consistency guarantees for concurrent access control

Content-Addressable Hashing:

Removes sessionId and processName from task hash computation
Enables content-based file hashing instead of path-based hashing
Allows identical tasks to share cache regardless of pipeline or session

Concurrency Control:

Simple collision-avoidance strategy using atomic cloud storage operations
Conditional PUT with preconditions (If-None-Match, ifGenerationMatch=0)
Hash increment on collision rather than waiting/polling (trades rare cache misses for simplicity)

Optimizations Considered:

Lineage ID (lid) reuse to reduce checksum computations
Cloud storage native checksums for hash computation

Implementation Phases

Phase 0 (Proof of Concept - #6100):

Associate nf-cloudcache path with global cache path
Use constant sessionId, remove processName from hash
Optional deep cache mode

Future Phases:

Content-based file hashing implementation
Cloud storage atomic lock acquisition
Configuration options and cleanup commands

Trade-offs and Limitations

Performance: Content hashing overhead for large files (mitigated by proposed optimizations)
Concurrency: Simultaneous execution of identical tasks results in redundant work (~1% of cases)
Compatibility: Conflicts with planned automatic workflow cleanup feature
Storage: Cloud storage costs (offset by compute savings from cache hits)

Related Issues

Proof of Concept: POC: Global Cache #6100

Status: Draft design document for discussion and feedback
Version: 1.0

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

bentsherman · 2026-02-03T15:50:01Z

adr/20260202-global-cache.md

+
+## Non-goals
+
+- **Maintaining local filesystem cache**: Global cache is cloud storage only


For what it's worth, it's actually quite easy to support local filesystems because nf-cloudcache works out-of-the-box with them. We only disallow it in the runtime, but #6100 re-allows it to help with testing.

The only consideration is handling race conditions. I assume you could just use regular file locks.

Does this mean that it will eventually work for non-cloud users too?

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

pditommaso · 2026-02-05T10:11:29Z

I'd like to explore an approach based on File Content Sketch + Bloom Filter.

This should allow avoding re-hashing large files by caching sketch → fullHash mappings, with a Bloom filter as fast pre-check.

Components:

Sketch→Hash store: persistent mapping of file sketches to their full BLAKE3 hashes
Bloom filter: fast check to avoid unnecessary store lookups

Flow:

Compute cheap sketch: hash(size, first4KB, last4KB, middle4KB) — ~1ms any file size
Check bloomFilter.mightContain(sketch)
- NO → sketch definitely not in store, compute full BLAKE3 hash, save mapping, update Bloom filter
- MAYBE → lookup sketch in store
  - Found → return cached full hash (skip expensive BLAKE3)
  - Not found (false positive) → compute full BLAKE3 hash, save mapping

jorgee · 2026-02-05T10:58:33Z

I will have a look about the file sketches, the trick of this solution is to find a good sample with very low chance of collision of file sketches that could produce false positives in the global cache.

pditommaso · 2026-02-05T10:59:33Z

Worth giving a try!

jorgee · 2026-02-05T12:28:17Z

I look deeper in the proposed solution:

I see some redundancy in the solution. Bloom filter and file sketch store are good to ensure when a file has no appear and it is mandatory to calculate the hash, but what is incorrect in the solution is using it to reuse the hash. If two different files can generate the same sketch, when going to the store (either through the bloom filter or not), you will get the same hash. Then, if we are allowing this fact, why not use the sketch as hash? The result will be the same. Then, there is no need for the store and bloom filter.
Regarding computation of sketches, options with a low chance of collision (minHash,..) require to access different blocks across the file. When computing them for cloud storage, either we need to download the whole file or do several calls, so it will not be fast.

pditommaso · 2026-02-05T13:33:52Z

When computing them for cloud storage, either we need to download the whole file

Not really, S3 api allows to to access arbitrary file chunks

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

jorgee · 2026-02-10T13:45:10Z

ADR Updates: File Hashing Alternatives and Cache Cleanup

I've updated the ADR with expanded analysis based on our discussion:

File Hashing Alternatives

Alternative 6: File Content Sketches + Bloom Filter (Rejected)

Documented the sketch-based approach (hash of size + first4KB + last4KB + middle4KB)
Rejection rationale: Collision correctness problem - when two different files produce the same sketch, the store returns the same hash for both, causing false cache hits in the global cache
Additional issues: redundancy between Bloom filter and store, cloud storage I/O overhead

Alternative 7: Fast Hash → Deep Hash Cache with Bloom Filter

Uses Nextflow's existing fast hash (path + size + mtime) as a lookup key to cache expensive deep content hashes
Key advantage: Fast hash is only a lookup key, not used for correctness - deep hash always computed correctly
Bloom filter avoids ~100ms cloud store GET + API cost for new files
No collision risk unlike sketches - worst case is cache miss requiring recomputation

Alternative 3: Extended with Copy-Tracking

Added simplified approach that tracks original fast hash when files are copied/published
Provides similar benefits to lineage-based approach without requiring full lineage infrastructure
Works by propagating original hash through Nextflow-managed copy operations

Cache Cleanup Mechanisms

Added comprehensive analysis of cache cleanup strategies:

Approach 1: Cloud Storage Lifecycle Policies (Rejected)

Simple but risky - may delete task outputs while downstream tasks are using them
Object-level deletion can't distinguish between actively-used and stale cache entries

Approach 2: Access-Time Based Cleanup (Recommended)

Task-level management with configurable TTL (e.g., 24h)
Downstream tasks "touch" upstream dependencies to update access times
Prevents deletion of actively-used cache entries
Race condition handling with atomic operations (conditional delete, atomic touch)
Manual cleanup commands for user control

Updated Comparison Tables

Extended comparison summaries to include all alternatives with analysis of performance, correctness, and implementation complexity trade-offs.

This comment was marked as off-topic.

Sign in to view

initial draft [ci skip]

46832f3

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

jorgee force-pushed the 20260202-global-cache branch from 861368a to 46832f3 Compare February 3, 2026 10:56

bentsherman reviewed Feb 3, 2026

View reviewed changes

update input file hashing

301965b

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

Add file hashing alternatives and cache cleanup analysis to ADR

65f129c

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR: Global cache#6796

ADR: Global cache#6796
jorgee wants to merge 3 commits intomasterfrom
20260202-global-cache

jorgee commented Feb 3, 2026 •

edited

Loading

Uh oh!

This comment was marked as off-topic.

bentsherman Feb 3, 2026

Uh oh!

suzannejin Feb 18, 2026

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

jorgee commented Feb 5, 2026

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

jorgee commented Feb 5, 2026

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

jorgee commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments


		## Non-goals

		- Maintaining local filesystem cache: Global cache is cloud storage only

Conversation

jorgee commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add ADR for Global Cache Feature

Overview

Key Design Decisions

Implementation Phases

Trade-offs and Limitations

Related Issues

Uh oh!

This comment was marked as off-topic.

bentsherman Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

suzannejin Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

jorgee commented Feb 5, 2026

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

jorgee commented Feb 5, 2026

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

jorgee commented Feb 10, 2026

ADR Updates: File Hashing Alternatives and Cache Cleanup

File Hashing Alternatives

Cache Cleanup Mechanisms

Updated Comparison Tables

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

jorgee commented Feb 3, 2026 •

edited

Loading