Skip to content

fix: per-sample MultiQC progressive closure via structural .join chain#1803

Draft
pinin4fjords wants to merge 4 commits intodevfrom
multiqc-per-sample-join
Draft

fix: per-sample MultiQC progressive closure via structural .join chain#1803
pinin4fjords wants to merge 4 commits intodevfrom
multiqc-per-sample-join

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented Apr 18, 2026

Summary

Fixes #1797. In --skip_quantification_merge mode the MULTIQC step blocks until every sample has finished, because ch_multiqc_files is one mixed queue grouped by groupTuple — and that groupTuple doesn't close until the slowest cross-sample contributor (ultimately the workflow-global versions topic) closes.

This PR replaces the single mixed queue with a per-sample bundle. Each MultiQC contributor joins a [id, meta, file…] tuple keyed by meta.id; the bundle grows per sample and closes per sample, so each sample's MULTIQC fires as soon as its own files are in. Matched samples fire progressively; contributors that never produce for a sample (feature off, optional upstream output, filter-excluded sample, pre-aligned BAM path) emit null via .join(..., remainder: true) and drop out on collapse.

What moved where

workflows/rnaseq/main.nf

  • New ch_mqc_per_sample_bundle seeded from ch_fastq plus the pre-aligned BAM branch (so BAM-only samples also get per-sample reports).
  • At each subworkflow call site, the existing .mix(X.out.Y) into ch_multiqc_files is paired with a .join(X.out.Y, remainder: true) extension of the per-sample bundle.
  • Multi-output subworkflows (FASTQ_QC, BAM_MARKDUPLICATES, BAM_QC_RNASEQ, RUSTQC, BAM_DEDUP_UMI, alignment-stats triples) join their own outputs by meta first, then fold into the bundle as a single grouped contribution — one re-key instead of one per output.

subworkflows/local/multiqc_rnaseq

  • Now owns the fail_trimmed / fail_mapped / fail_strand per-sample TSVs and their merged-mode aggregates (previously scattered across main.nf and one nf-core subworkflow).
  • Each fail_* stream is anchored on the bundle seed (every bundle sample, fastq + pre-aligned BAM) so every sample has a match on the join, keeping progressive closure intact even when no samples actually fail, alignment is skipped, or input is BAM-only.
  • Collapses the raw [id, meta, file…] bundle into [meta, files_list] and adds the run-level globals (workflow summary, methods description, a pipeline-identity versions YAML in per-sample mode).
  • Keeps the merged-mode path effectively unchanged from dev, just accumulating the same contributor files and global context into one MULTIQC call.

subworkflows/nf-core/fastq_qc_trim_filter_setstrandedness

  • Subtractive: drops the in-subworkflow fail_trimmed_samples_mqc.tsv aggregation. trim_read_count is already an emit, so the caller builds per-sample or aggregate forms from it. Will need a matching nf-core/subworkflows PR before final merge.

subworkflows/nf-core/bam_dedup_umi

  • Adds genome_stats / genome_flagstat / genome_idxstats emits alongside the existing pre-mixed stats / flagstat / idxstats. The pre-mixed emits cover genome+transcriptome (two tuples per sample) which can't be 1:1 joined; the new emits are genome-only and match the subworkflow's own multiqc_files selection. Will need a matching nf-core/subworkflows PR before final merge.

The nf-core lint check fails on these two local-copy drifts — that's the only red mark once CI settles and is expected until the upstream PRs land.

Why remainder: true

A contributor's channel closes when its per-sample process finishes (fast), not when the workflow finishes. So remainder: true only waits the per-contributor amount for unmatched samples — no workflow-global barrier. Verified with a 60 s artificial delay on one sample's SALMON_QUANT: the other samples' MULTIQC tasks fire within milliseconds of each other, the slowed sample fires ~50 s later.

Known limitation

Per-sample reports carry a pipeline-identity multiqc_software_versions.txt (pipeline name + version + commit SHA + Nextflow version, via the existing workflowVersionToYAML() helper) rather than the full collated tool versions. The full YAML would close only after the workflow-global versions topic closes, reintroducing a full-run barrier. The full collated YAML is still published unchanged to pipeline_info/ for consumers who need it.

Tests

All existing snapshot tests pass without edits: tests/default.nf.test, tests/default.nf.test (stub), tests/skip_quantification_merge.nf.test, tests/skip_quantification_merge.nf.test (stub), tests/skip_trimming.nf.test, the four UMI cases (tests/umi.nf.test, including stub), and the two BAM-input cases (tests/bam_input.nf.test).

Progressive-closure trace (on --skip_quantification_merge --pseudo_aligner salmon --skip_alignment with a 60 s sleep on WT_REP2's SALMON_QUANT):

18:49:27.040  MULTIQC (RAP1_UNINDUCED_REP2)
18:49:27.047  MULTIQC (RAP1_IAA_30M_REP1)
18:49:27.053  MULTIQC (RAP1_UNINDUCED_REP1)
18:49:30.365  MULTIQC (WT_REP1)
18:50:20.291  MULTIQC (WT_REP2)       ← +53 s, bounded by the delay

Supersedes

#1800 (groupKey count helper), #1802 (.join(remainder: true) with the pre-existing bundle shape).

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 18, 2026

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit ca3e7b2

+| ✅ 202 tests passed       |+
#| ❔  12 tests were ignored |#
!| ❗   8 tests had warnings |!
Details

❗ Test warnings:

  • files_exist - File not found: assets/multiqc_config.yml
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
  • pipeline_todos - TODO string in nextflow.config: Specify any additional parameters here
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 3.5.2
  • Run at 2026-04-18 21:37:18

@pinin4fjords pinin4fjords marked this pull request as ready for review April 18, 2026 16:15
@pinin4fjords pinin4fjords marked this pull request as draft April 18, 2026 16:43
pinin4fjords added a commit that referenced this pull request Apr 18, 2026
- Move `byId` rekey into `subworkflows/local/utils_nfcore_rnaseq_pipeline`
  as a shared function; main.nf and multiqc_rnaseq both include it.
  Kills the two-copy DRY violation the reviewer flagged.
- Drop the pass-branch `[meta, []]` placeholder `.mix`es and the
  subsequent `.flatten()` in both fail_trimmed and fail_mapped streams:
  `.join(..., remainder: true)` in the collapse already handles absent
  samples, so the placeholders were doing nothing except forcing the
  flatten.
- Compute `skip:` for the merged-mode fail_* `collectFile` from
  `sample_status_header.readLines().size() + 1` rather than the hardcoded
  `4`, so editing the header file no longer silently mis-skips.
- Final collapse switches from `findAll()` (which relied on Groovy truth)
  to `findAll { it != null }`; explicit and safer once placeholders are
  out. One-line note on why `.flatten()` would be wrong here (Paths are
  iterable).
- Delete the now-unreferenced `multiqcTsvFromList` function from
  `subworkflows/nf-core/fastq_qc_trim_filter_setstrandedness`.
- Tighten the bundle-join comment in main.nf to spell out the per-
  contributor (not workflow-global) barrier that `remainder: true`
  introduces for unmatched samples, per reviewer §2.
- Trim the `ch_collated_versions` omission block to six lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Alongside the existing `stats` / `flagstat` / `idxstats` emits (which
mix genome + transcriptome samtools output, two tuples per sample),
add three new named emits — `genome_stats`, `genome_flagstat`,
`genome_idxstats` — sourced from `UMI_DEDUP_GENOME` only. These are
one-tuple-per-sample and can be joined 1:1 against a per-sample
bundle, enabling progressive per-sample MultiQC. The transcriptome
outputs are deliberately not bundled (same rationale as the existing
in-subworkflow `multiqc_files` selection).
…ication_merge

In `--skip_quantification_merge` mode MULTIQC blocked until every
sample finished because `ch_multiqc_files` was grouped via
`groupTuple` on the full mixed queue, which only closes once the
slowest cross-sample contributor closes (ultimately the workflow-
global `versions` topic).

Replace the mixed-queue-plus-`groupTuple` pattern with a per-sample
bundle: each contributor extends a `[id, meta, file...]` tuple keyed
on `meta.id` via `.join(..., remainder: true)`, and MULTIQC_RNASEQ
collapses it to `[meta, files_list]`. Matched samples fire
progressively — each sample's MULTIQC task submits as soon as its
contributors have emitted. Verified under a synthetic 60 s delay on
one sample's `SALMON_QUANT`: the other samples' MULTIQC tasks fire
within milliseconds of each other, the slowed sample ~50 s later.

Multi-output subworkflows (FASTQ_QC, BAM_QC_RNASEQ, BAM_MARKDUPLICATES,
RUSTQC, BAM_DEDUP_UMI, alignment stats triples) are aggregated on
meta internally before being folded into the bundle once, so the
bundle grows by one position per subworkflow rather than per output.

fail_trimmed / fail_mapped / fail_strand TSVs are built per sample —
anchored on the all-fastq-samples channel so every bundle sample has
a match, keeping the join progressive — plus a merged-mode aggregate
for the flat `ch_multiqc_files` path. The in-subworkflow
`fail_trimmed_samples_mqc.tsv` aggregation in
`subworkflows/nf-core/fastq_qc_trim_filter_setstrandedness` is
removed; the caller owns per-sample and merged forms.

Per-sample reports carry a pipeline-identity YAML
(`multiqc_software_versions.txt` = name + version + commit SHA +
Nextflow version, via `workflowVersionToYAML()`) rather than the full
collated tool versions; the full versions YAML is still published
unchanged to `pipeline_info/`. The full YAML would only close once
the workflow-global `versions` topic closes, reintroducing a full-run
barrier.

Closes #1797. Supersedes #1800 (groupKey count helper), #1802
(`.join(remainder: true)` with the pre-existing bundle shape).
@pinin4fjords pinin4fjords force-pushed the multiqc-per-sample-join branch from 6c692e6 to 77c76de Compare April 18, 2026 20:11
…count

Under `--skip_trimming` the FASTQ_QC subworkflow never emits
`trim_read_count`, so the per-sample anchor used to thread fail_*
streams through the bundle was empty. When `ch_strand_comparison`
still had items (alignment runs regardless of trimming), the
anchor-side-empty remainder join produced right-only unmatched tuples
`[id, null, f]`, and the downstream re-key `.map { meta, f -> [meta.id, f] }`
then hit a null meta and threw "Cannot get property 'id' on null object".

`ch_fastq` is the canonical "every fastq-branch sample" channel in the
workflow — already passed to the subworkflow for name replacements —
and is populated whether or not trimming runs. Use it as the anchor.
…ch samples

The previous anchor (`ch_fastq`) is empty for the BAM-only input case
(samples that come in via samplesheet `genome_bam` / `transcriptome_bam`
columns under `--skip_alignment`), so `ch_fail_*_all` had no left side
and the right-only `.join(..., remainder: true)` produced
`[id, null, f]` tuples whose downstream re-key exploded with
"Cannot get property 'id' on null object".

Expose the bundle seed in `workflows/rnaseq/main.nf` as
`ch_mqc_bundle_seed` (mix of the fastq and pre-aligned BAM branches)
and pass it through as the subworkflow's per-sample anchor. Every
bundle sample now has a matching placeholder on every fail_* stream
regardless of which branch it came in on.

Verified on the VM: default, default-stub, skip_quantification_merge
(real + stub), skip_trimming, and both bam_input cases all pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant