Skip to content

PICARD: demote process_medium → process_low (byte-identical, ~3× throughput) #1801

@peachgabba22

Description

@peachgabba22

Description of feature

Follow-up to #1759. That one asked for a tool swap to samtools markdup, and was (fairly) declined on complexity and operational grounds — @pinin4fjords explicitly noted that any change here would need to be a straight replacement, not an option, and that Picard is predictable and well understood. This proposal keeps Picard untouched. It's a one-line change to the resource label the module uses.

The finding

modules/nf-core/picard/markduplicates/main.nf uses label 'process_medium', and conf/base.config sets process_medium = 36.GB × task.attempt. The JVM is launched with heap = task.memory.mega × 0.8-Xmx28g on first attempt, -Xmx57g on retry. There is no sample-size-dependent routing — every invocation from a 2 GB smoke BAM to a 13 GB ENCODE replicate gets the same reservation.

On an 8-sample ENCODE RNA-seq corpus (1.55 B records, same dataset as #1759 but re-sequenced on Picard 3.4.0):

  • measured per-JVM peak RSS never exceeds ~8 GB under the default -Xmx28g (and drops close to the heap limit under tighter -Xmx).
  • on a 30 GB node (e.g. Hetzner cpx62, or most single-box cloud instances with < 36 GB), Nextflow can't schedule process_medium at all — the 36 GB request exceeds available memory.
  • on a 72 GB node, exactly one MarkDuplicates task fits at a time, where 3–4 would fit by actual footprint.

Sweep result

Full matrix in picard-tuning-proposal.md. Summary on the 8 ENCODE samples, same box, same Picard 3.4.0 invocation flags:

Config Heap × parallelism 8-sample wall vs default Peak per-JVM RSS OOM
A (nf-core default) -Xmx28g × 1 12947 s 1.00× 30.75 GB 0
B -Xmx7g × 3 5514 s 2.35× 7.95 GB 0
C (proposed) -Xmx6g × 4 4240 s 3.05× 7.11 GB 0
D -Xmx5g × 4 4247 s 3.05× 6.16 GB 0

Per-sample wall under 4-way parallelism is +2.0–2.1% vs the single-JVM baseline (shared /tmp spill) — the total wall compresses ~3× because 4 samples process concurrently.

Retry envelope: a "werewolf" BAM (2× largest ENCODE sample, 445 M records, 17 GB) completes at -Xmx9g with peak RSS 9.6 GB. Two werewolves in parallel at -Xmx9g → sum RSS 20.8 GB, same wall as one, no OOM. So process_low = 12.GB × task.attempt (heap -Xmx9g on first attempt, -Xmx19g on retry) has comfortable margin for RNA-seq library sizes well beyond what the default retries cover.

Parity

For each of the 8 samples we computed samtools view X.mkdup.bam | awk '{print $1"\t"$2}' | LC_ALL=C sort | md5sum and compared A vs C. Metrics files diffed with #-prefixed lines excluded.

Result: 8/8 byte-identical. Zero QNAME+FLAG divergence, zero metrics data divergence. Full md5 table in the linked doc. Heap size affects GC scheduling and spill batching, not the duplicate-detection algorithm — Picard MarkDuplicates is deterministic given identical input, flags, and ASSUME_SORT_ORDER=coordinate.

Proposed change

One line, in nf-core/modules/modules/nf-core/picard/markduplicates/main.nf:

 process PICARD_MARKDUPLICATES {
-    label 'process_medium'
+    label 'process_low'

process_low in nf-core's standard conf/base.config is memory = 12.GB × task.attempt ⇒ heap -Xmx9g on first attempt, -Xmx19g on retry. This sits comfortably inside the measured werewolf envelope.

What this does and doesn't touch

  • Doesn't add a tool-choice parameter, doesn't add a config knob, doesn't change Picard's CLI, doesn't change output.
  • Doesn't affect the UMI dedup path (which stays on Picard).
  • Does preserve Nextflow retry semantics — task.attempt = 2 still doubles the reservation (to process_low × 2 = 24 GB / -Xmx19g), which is inside the werewolf-at-retry envelope with headroom.
  • Does unblock scheduling on any node with ≥ 12 GB available, and enables real parallelism on 30 GB / 72 GB nodes that currently serialize.

Scope & caveats

The benchmark corpus is RNA-seq only, ASSUME_SORT_ORDER=coordinate (nf-core default), optical dedup off (nf-core default). Explicitly not covered: WGS / WGBS / long-read (RSS scales with library complexity, may need more heap), MarkDuplicatesSpark (different code path), unsorted BAMs, very large libraries (> 2× werewolf, i.e. > 445 M records per BAM). If any of those workloads are known to need the current 36 GB reservation, that's exactly the kind of tribal knowledge I'd want to hear — I'm not a practicing bioinformatician, and the measurement above is the only ground truth I have.

Related work (context, not part of this ask)

A Rust rewrite of Picard MarkDuplicates — WeTheAgents/markdup8x-wea — produces byte-identical output to Picard 3.4.0 on the same 8 ENCODE samples (8/8 parity, 934 M duplicates, zero flag divergence) at 3.26× wall and 54× less RAM. Published under the rewrites.bio policy as a reference / drop-in, not as a proposed swap here — I understand from #1759 that tool swaps are off the table, and that's fine. Mentioning it only because it's what surfaced the over-allocation finding in the first place.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions