Description of feature
Follow-up to #1759. That one asked for a tool swap to samtools markdup, and was (fairly) declined on complexity and operational grounds — @pinin4fjords explicitly noted that any change here would need to be a straight replacement, not an option, and that Picard is predictable and well understood. This proposal keeps Picard untouched. It's a one-line change to the resource label the module uses.
The finding
modules/nf-core/picard/markduplicates/main.nf uses label 'process_medium', and conf/base.config sets process_medium = 36.GB × task.attempt. The JVM is launched with heap = task.memory.mega × 0.8 ⇒ -Xmx28g on first attempt, -Xmx57g on retry. There is no sample-size-dependent routing — every invocation from a 2 GB smoke BAM to a 13 GB ENCODE replicate gets the same reservation.
On an 8-sample ENCODE RNA-seq corpus (1.55 B records, same dataset as #1759 but re-sequenced on Picard 3.4.0):
- measured per-JVM peak RSS never exceeds ~8 GB under the default
-Xmx28g (and drops close to the heap limit under tighter -Xmx).
- on a 30 GB node (e.g. Hetzner cpx62, or most single-box cloud instances with < 36 GB), Nextflow can't schedule
process_medium at all — the 36 GB request exceeds available memory.
- on a 72 GB node, exactly one MarkDuplicates task fits at a time, where 3–4 would fit by actual footprint.
Sweep result
Full matrix in picard-tuning-proposal.md. Summary on the 8 ENCODE samples, same box, same Picard 3.4.0 invocation flags:
| Config |
Heap × parallelism |
8-sample wall |
vs default |
Peak per-JVM RSS |
OOM |
| A (nf-core default) |
-Xmx28g × 1 |
12947 s |
1.00× |
30.75 GB |
0 |
| B |
-Xmx7g × 3 |
5514 s |
2.35× |
7.95 GB |
0 |
| C (proposed) |
-Xmx6g × 4 |
4240 s |
3.05× |
7.11 GB |
0 |
| D |
-Xmx5g × 4 |
4247 s |
3.05× |
6.16 GB |
0 |
Per-sample wall under 4-way parallelism is +2.0–2.1% vs the single-JVM baseline (shared /tmp spill) — the total wall compresses ~3× because 4 samples process concurrently.
Retry envelope: a "werewolf" BAM (2× largest ENCODE sample, 445 M records, 17 GB) completes at -Xmx9g with peak RSS 9.6 GB. Two werewolves in parallel at -Xmx9g → sum RSS 20.8 GB, same wall as one, no OOM. So process_low = 12.GB × task.attempt (heap -Xmx9g on first attempt, -Xmx19g on retry) has comfortable margin for RNA-seq library sizes well beyond what the default retries cover.
Parity
For each of the 8 samples we computed samtools view X.mkdup.bam | awk '{print $1"\t"$2}' | LC_ALL=C sort | md5sum and compared A vs C. Metrics files diffed with #-prefixed lines excluded.
Result: 8/8 byte-identical. Zero QNAME+FLAG divergence, zero metrics data divergence. Full md5 table in the linked doc. Heap size affects GC scheduling and spill batching, not the duplicate-detection algorithm — Picard MarkDuplicates is deterministic given identical input, flags, and ASSUME_SORT_ORDER=coordinate.
Proposed change
One line, in nf-core/modules/modules/nf-core/picard/markduplicates/main.nf:
process PICARD_MARKDUPLICATES {
- label 'process_medium'
+ label 'process_low'
process_low in nf-core's standard conf/base.config is memory = 12.GB × task.attempt ⇒ heap -Xmx9g on first attempt, -Xmx19g on retry. This sits comfortably inside the measured werewolf envelope.
What this does and doesn't touch
- Doesn't add a tool-choice parameter, doesn't add a config knob, doesn't change Picard's CLI, doesn't change output.
- Doesn't affect the UMI dedup path (which stays on Picard).
- Does preserve Nextflow retry semantics —
task.attempt = 2 still doubles the reservation (to process_low × 2 = 24 GB / -Xmx19g), which is inside the werewolf-at-retry envelope with headroom.
- Does unblock scheduling on any node with ≥ 12 GB available, and enables real parallelism on 30 GB / 72 GB nodes that currently serialize.
Scope & caveats
The benchmark corpus is RNA-seq only, ASSUME_SORT_ORDER=coordinate (nf-core default), optical dedup off (nf-core default). Explicitly not covered: WGS / WGBS / long-read (RSS scales with library complexity, may need more heap), MarkDuplicatesSpark (different code path), unsorted BAMs, very large libraries (> 2× werewolf, i.e. > 445 M records per BAM). If any of those workloads are known to need the current 36 GB reservation, that's exactly the kind of tribal knowledge I'd want to hear — I'm not a practicing bioinformatician, and the measurement above is the only ground truth I have.
Related work (context, not part of this ask)
A Rust rewrite of Picard MarkDuplicates — WeTheAgents/markdup8x-wea — produces byte-identical output to Picard 3.4.0 on the same 8 ENCODE samples (8/8 parity, 934 M duplicates, zero flag divergence) at 3.26× wall and 54× less RAM. Published under the rewrites.bio policy as a reference / drop-in, not as a proposed swap here — I understand from #1759 that tool swaps are off the table, and that's fine. Mentioning it only because it's what surfaced the over-allocation finding in the first place.
Description of feature
Follow-up to #1759. That one asked for a tool swap to samtools markdup, and was (fairly) declined on complexity and operational grounds — @pinin4fjords explicitly noted that any change here would need to be a straight replacement, not an option, and that Picard is predictable and well understood. This proposal keeps Picard untouched. It's a one-line change to the resource label the module uses.
The finding
modules/nf-core/picard/markduplicates/main.nfuseslabel 'process_medium', andconf/base.configsetsprocess_medium = 36.GB × task.attempt. The JVM is launched with heap =task.memory.mega × 0.8⇒-Xmx28gon first attempt,-Xmx57gon retry. There is no sample-size-dependent routing — every invocation from a 2 GB smoke BAM to a 13 GB ENCODE replicate gets the same reservation.On an 8-sample ENCODE RNA-seq corpus (1.55 B records, same dataset as #1759 but re-sequenced on Picard 3.4.0):
-Xmx28g(and drops close to the heap limit under tighter-Xmx).process_mediumat all — the 36 GB request exceeds available memory.Sweep result
Full matrix in picard-tuning-proposal.md. Summary on the 8 ENCODE samples, same box, same Picard 3.4.0 invocation flags:
-Xmx28g × 1-Xmx7g × 3-Xmx6g × 4-Xmx5g × 4Per-sample wall under 4-way parallelism is +2.0–2.1% vs the single-JVM baseline (shared
/tmpspill) — the total wall compresses ~3× because 4 samples process concurrently.Retry envelope: a "werewolf" BAM (2× largest ENCODE sample, 445 M records, 17 GB) completes at
-Xmx9gwith peak RSS 9.6 GB. Two werewolves in parallel at-Xmx9g→ sum RSS 20.8 GB, same wall as one, no OOM. Soprocess_low = 12.GB × task.attempt(heap-Xmx9gon first attempt,-Xmx19gon retry) has comfortable margin for RNA-seq library sizes well beyond what the default retries cover.Parity
For each of the 8 samples we computed
samtools view X.mkdup.bam | awk '{print $1"\t"$2}' | LC_ALL=C sort | md5sumand compared A vs C. Metrics files diffed with#-prefixed lines excluded.Result: 8/8 byte-identical. Zero QNAME+FLAG divergence, zero metrics data divergence. Full md5 table in the linked doc. Heap size affects GC scheduling and spill batching, not the duplicate-detection algorithm — Picard MarkDuplicates is deterministic given identical input, flags, and
ASSUME_SORT_ORDER=coordinate.Proposed change
One line, in
nf-core/modules/modules/nf-core/picard/markduplicates/main.nf:process PICARD_MARKDUPLICATES { - label 'process_medium' + label 'process_low'process_lowin nf-core's standardconf/base.configismemory = 12.GB × task.attempt⇒ heap-Xmx9gon first attempt,-Xmx19gon retry. This sits comfortably inside the measured werewolf envelope.What this does and doesn't touch
task.attempt = 2still doubles the reservation (toprocess_low × 2 = 24 GB / -Xmx19g), which is inside the werewolf-at-retry envelope with headroom.Scope & caveats
The benchmark corpus is RNA-seq only,
ASSUME_SORT_ORDER=coordinate(nf-core default), optical dedup off (nf-core default). Explicitly not covered: WGS / WGBS / long-read (RSS scales with library complexity, may need more heap), MarkDuplicatesSpark (different code path), unsorted BAMs, very large libraries (> 2× werewolf, i.e. > 445 M records per BAM). If any of those workloads are known to need the current 36 GB reservation, that's exactly the kind of tribal knowledge I'd want to hear — I'm not a practicing bioinformatician, and the measurement above is the only ground truth I have.Related work (context, not part of this ask)
A Rust rewrite of Picard MarkDuplicates — WeTheAgents/markdup8x-wea — produces byte-identical output to Picard 3.4.0 on the same 8 ENCODE samples (8/8 parity, 934 M duplicates, zero flag divergence) at 3.26× wall and 54× less RAM. Published under the rewrites.bio policy as a reference / drop-in, not as a proposed swap here — I understand from #1759 that tool swaps are off the table, and that's fine. Mentioning it only because it's what surfaced the over-allocation finding in the first place.