PICARD: demote process_medium → process_low (byte-identical, ~3× throughput)

### Description of feature

Follow-up to #1759. That one asked for a **tool swap** to samtools markdup, and was (fairly) declined on complexity and operational grounds — @pinin4fjords explicitly noted that any change here would need to be *a straight replacement, not an option*, and that Picard is predictable and well understood. This proposal keeps Picard untouched. It's a one-line change to the resource label the module uses.

## The finding

`modules/nf-core/picard/markduplicates/main.nf` uses `label 'process_medium'`, and `conf/base.config` sets `process_medium = 36.GB × task.attempt`. The JVM is launched with heap = `task.memory.mega × 0.8` ⇒ **`-Xmx28g` on first attempt, `-Xmx57g` on retry**. There is no sample-size-dependent routing — every invocation from a 2 GB smoke BAM to a 13 GB ENCODE replicate gets the same reservation.

On an 8-sample ENCODE RNA-seq corpus (1.55 B records, same dataset as #1759 but re-sequenced on Picard 3.4.0):

- measured per-JVM peak RSS never exceeds **~8 GB** under the default `-Xmx28g` (and drops close to the heap limit under tighter `-Xmx`).
- on a 30 GB node (e.g. Hetzner cpx62, or most single-box cloud instances with < 36 GB), Nextflow **can't schedule `process_medium` at all** — the 36 GB request exceeds available memory.
- on a 72 GB node, exactly one MarkDuplicates task fits at a time, where 3–4 would fit by actual footprint.

## Sweep result

Full matrix in [picard-tuning-proposal.md](https://github.com/WeTheAgents/markdup8x-wea/blob/main/docs/picard-tuning-proposal.md). Summary on the 8 ENCODE samples, same box, same Picard 3.4.0 invocation flags:

| Config | Heap × parallelism | 8-sample wall | vs default | Peak per-JVM RSS | OOM |
|---|---|---|---|---|---|
| A (nf-core default) | `-Xmx28g × 1` | 12947 s | 1.00× | 30.75 GB | 0 |
| B | `-Xmx7g × 3` | 5514 s | 2.35× | 7.95 GB | 0 |
| **C (proposed)** | **`-Xmx6g × 4`** | **4240 s** | **3.05×** | **7.11 GB** | **0** |
| D | `-Xmx5g × 4` | 4247 s | 3.05× | 6.16 GB | 0 |

Per-sample wall under 4-way parallelism is +2.0–2.1% vs the single-JVM baseline (shared `/tmp` spill) — the total wall compresses ~3× because 4 samples process concurrently.

Retry envelope: a "werewolf" BAM (2× largest ENCODE sample, 445 M records, 17 GB) completes at `-Xmx9g` with peak RSS 9.6 GB. Two werewolves in parallel at `-Xmx9g` → sum RSS 20.8 GB, same wall as one, no OOM. So `process_low = 12.GB × task.attempt` (heap `-Xmx9g` on first attempt, `-Xmx19g` on retry) has comfortable margin for RNA-seq library sizes well beyond what the default retries cover.

## Parity

For each of the 8 samples we computed `samtools view X.mkdup.bam | awk '{print $1"\t"$2}' | LC_ALL=C sort | md5sum` and compared A vs C. Metrics files diffed with `#`-prefixed lines excluded.

**Result: 8/8 byte-identical. Zero QNAME+FLAG divergence, zero metrics data divergence.** Full md5 table in the linked doc. Heap size affects GC scheduling and spill batching, not the duplicate-detection algorithm — Picard MarkDuplicates is deterministic given identical input, flags, and `ASSUME_SORT_ORDER=coordinate`.

## Proposed change

One line, in `nf-core/modules/modules/nf-core/picard/markduplicates/main.nf`:

```diff
 process PICARD_MARKDUPLICATES {
-    label 'process_medium'
+    label 'process_low'
```

`process_low` in nf-core's standard `conf/base.config` is `memory = 12.GB × task.attempt` ⇒ heap `-Xmx9g` on first attempt, `-Xmx19g` on retry. This sits comfortably inside the measured werewolf envelope.

## What this does and doesn't touch

- **Doesn't** add a tool-choice parameter, doesn't add a config knob, doesn't change Picard's CLI, doesn't change output.
- **Doesn't** affect the UMI dedup path (which stays on Picard).
- **Does** preserve Nextflow retry semantics — `task.attempt = 2` still doubles the reservation (to `process_low × 2 = 24 GB / -Xmx19g`), which is inside the werewolf-at-retry envelope with headroom.
- **Does** unblock scheduling on any node with ≥ 12 GB available, and enables real parallelism on 30 GB / 72 GB nodes that currently serialize.

## Scope & caveats

The benchmark corpus is RNA-seq only, `ASSUME_SORT_ORDER=coordinate` (nf-core default), optical dedup off (nf-core default). Explicitly **not covered**: WGS / WGBS / long-read (RSS scales with library complexity, may need more heap), MarkDuplicatesSpark (different code path), unsorted BAMs, very large libraries (> 2× werewolf, i.e. > 445 M records per BAM). If any of those workloads are known to need the current 36 GB reservation, that's exactly the kind of tribal knowledge I'd want to hear — I'm not a practicing bioinformatician, and the measurement above is the only ground truth I have.

## Related work (context, not part of this ask)

A Rust rewrite of Picard MarkDuplicates — [WeTheAgents/markdup8x-wea](https://github.com/WeTheAgents/markdup8x-wea) — produces byte-identical output to Picard 3.4.0 on the same 8 ENCODE samples (8/8 parity, 934 M duplicates, zero flag divergence) at 3.26× wall and 54× less RAM. Published under the [rewrites.bio](https://rewrites.bio/) policy as a reference / drop-in, **not** as a proposed swap here — I understand from #1759 that tool swaps are off the table, and that's fine. Mentioning it only because it's what surfaced the over-allocation finding in the first place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PICARD: demote process_medium → process_low (byte-identical, ~3× throughput) #1801

Description of feature

The finding

Sweep result

Parity

Proposed change

What this does and doesn't touch

Scope & caveats

Related work (context, not part of this ask)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Config	Heap × parallelism	8-sample wall	vs default	Peak per-JVM RSS
A (nf-core default)	`-Xmx28g × 1`	12947 s	1.00×	30.75 GB
B	`-Xmx7g × 3`	5514 s	2.35×	7.95 GB
C (proposed)	`-Xmx6g × 4`	4240 s	3.05×	7.11 GB
D	`-Xmx5g × 4`	4247 s	3.05×	6.16 GB

PICARD: demote process_medium → process_low (byte-identical, ~3× throughput) #1801

Description

Description of feature

The finding

Sweep result

Parity

Proposed change

What this does and doesn't touch

Scope & caveats

Related work (context, not part of this ask)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions