Skip to content

ADR: Global cache#6796

Draft
jorgee wants to merge 3 commits intomasterfrom
20260202-global-cache
Draft

ADR: Global cache#6796
jorgee wants to merge 3 commits intomasterfrom
20260202-global-cache

Conversation

@jorgee
Copy link
Contributor

@jorgee jorgee commented Feb 3, 2026

Add ADR for Global Cache Feature

This PR introduces an Architecture Decision Record (ADR) for a global cache system that enables cross-pipeline task result sharing in Nextflow.

Overview

The global cache extends Nextflow's existing task caching mechanism to allow different pipeline executions to reuse computational results across:

  • Different users running the same analysis
  • Development and production pipeline versions
  • Parameter sweeps with shared preprocessing steps
  • Multi-environment deployments

Key Design Decisions

Architecture:

  • Builds on existing nf-cloudcache plugin infrastructure
  • Uses cloud object storage (S3, GCS, Azure Blob) as backend
  • Leverages strong consistency guarantees for concurrent access control

Content-Addressable Hashing:

  • Removes sessionId and processName from task hash computation
  • Enables content-based file hashing instead of path-based hashing
  • Allows identical tasks to share cache regardless of pipeline or session

Concurrency Control:

  • Simple collision-avoidance strategy using atomic cloud storage operations
  • Conditional PUT with preconditions (If-None-Match, ifGenerationMatch=0)
  • Hash increment on collision rather than waiting/polling (trades rare cache misses for simplicity)

Optimizations Considered:

  • Lineage ID (lid) reuse to reduce checksum computations
  • Cloud storage native checksums for hash computation

Implementation Phases

Phase 0 (Proof of Concept - #6100):

  • Associate nf-cloudcache path with global cache path
  • Use constant sessionId, remove processName from hash
  • Optional deep cache mode

Future Phases:

  • Content-based file hashing implementation
  • Cloud storage atomic lock acquisition
  • Configuration options and cleanup commands

Trade-offs and Limitations

  • Performance: Content hashing overhead for large files (mitigated by proposed optimizations)
  • Concurrency: Simultaneous execution of identical tasks results in redundant work (~1% of cases)
  • Compatibility: Conflicts with planned automatic workflow cleanup feature
  • Storage: Cloud storage costs (offset by compute savings from cache hits)

Related Issues


Status: Draft design document for discussion and feedback
Version: 1.0

@netlify

This comment was marked as off-topic.

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee jorgee force-pushed the 20260202-global-cache branch from 861368a to 46832f3 Compare February 3, 2026 10:56

## Non-goals

- **Maintaining local filesystem cache**: Global cache is cloud storage only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, it's actually quite easy to support local filesystems because nf-cloudcache works out-of-the-box with them. We only disallow it in the runtime, but #6100 re-allows it to help with testing.

The only consideration is handling race conditions. I assume you could just use regular file locks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that it will eventually work for non-cloud users too?

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@pditommaso
Copy link
Member

I'd like to explore an approach based on File Content Sketch + Bloom Filter.

This should allow avoding re-hashing large files by caching sketch → fullHash mappings, with a Bloom filter as fast pre-check.

Components:

  • Sketch→Hash store: persistent mapping of file sketches to their full BLAKE3 hashes
  • Bloom filter: fast check to avoid unnecessary store lookups

Flow:

  1. Compute cheap sketch: hash(size, first4KB, last4KB, middle4KB) — ~1ms any file size
  2. Check bloomFilter.mightContain(sketch)
    • NO → sketch definitely not in store, compute full BLAKE3 hash, save mapping, update Bloom filter
    • MAYBE → lookup sketch in store
      • Found → return cached full hash (skip expensive BLAKE3)
      • Not found (false positive) → compute full BLAKE3 hash, save mapping

@jorgee
Copy link
Contributor Author

jorgee commented Feb 5, 2026

I will have a look about the file sketches, the trick of this solution is to find a good sample with very low chance of collision of file sketches that could produce false positives in the global cache.

@pditommaso
Copy link
Member

Worth giving a try!

@jorgee
Copy link
Contributor Author

jorgee commented Feb 5, 2026

I look deeper in the proposed solution:

  • I see some redundancy in the solution. Bloom filter and file sketch store are good to ensure when a file has no appear and it is mandatory to calculate the hash, but what is incorrect in the solution is using it to reuse the hash. If two different files can generate the same sketch, when going to the store (either through the bloom filter or not), you will get the same hash. Then, if we are allowing this fact, why not use the sketch as hash? The result will be the same. Then, there is no need for the store and bloom filter.
  • Regarding computation of sketches, options with a low chance of collision (minHash,..) require to access different blocks across the file. When computing them for cloud storage, either we need to download the whole file or do several calls, so it will not be fast.

@pditommaso
Copy link
Member

When computing them for cloud storage, either we need to download the whole file

Not really, S3 api allows to to access arbitrary file chunks

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Contributor Author

jorgee commented Feb 10, 2026

ADR Updates: File Hashing Alternatives and Cache Cleanup

I've updated the ADR with expanded analysis based on our discussion:

File Hashing Alternatives

Alternative 6: File Content Sketches + Bloom Filter (Rejected)

  • Documented the sketch-based approach (hash of size + first4KB + last4KB + middle4KB)
  • Rejection rationale: Collision correctness problem - when two different files produce the same sketch, the store returns the same hash for both, causing false cache hits in the global cache
  • Additional issues: redundancy between Bloom filter and store, cloud storage I/O overhead

Alternative 7: Fast Hash → Deep Hash Cache with Bloom Filter

  • Uses Nextflow's existing fast hash (path + size + mtime) as a lookup key to cache expensive deep content hashes
  • Key advantage: Fast hash is only a lookup key, not used for correctness - deep hash always computed correctly
  • Bloom filter avoids ~100ms cloud store GET + API cost for new files
  • No collision risk unlike sketches - worst case is cache miss requiring recomputation

Alternative 3: Extended with Copy-Tracking

  • Added simplified approach that tracks original fast hash when files are copied/published
  • Provides similar benefits to lineage-based approach without requiring full lineage infrastructure
  • Works by propagating original hash through Nextflow-managed copy operations

Cache Cleanup Mechanisms

Added comprehensive analysis of cache cleanup strategies:

Approach 1: Cloud Storage Lifecycle Policies (Rejected)

  • Simple but risky - may delete task outputs while downstream tasks are using them
  • Object-level deletion can't distinguish between actively-used and stale cache entries

Approach 2: Access-Time Based Cleanup (Recommended)

  • Task-level management with configurable TTL (e.g., 24h)
  • Downstream tasks "touch" upstream dependencies to update access times
  • Prevents deletion of actively-used cache entries
  • Race condition handling with atomic operations (conditional delete, atomic touch)
  • Manual cleanup commands for user control

Updated Comparison Tables

Extended comparison summaries to include all alternatives with analysis of performance, correctness, and implementation complexity trade-offs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments