fix: per-sample MultiQC progressive closure via structural .join chain#1803
Draft
pinin4fjords wants to merge 4 commits intodevfrom
Draft
fix: per-sample MultiQC progressive closure via structural .join chain#1803pinin4fjords wants to merge 4 commits intodevfrom
pinin4fjords wants to merge 4 commits intodevfrom
Conversation
|
pinin4fjords
added a commit
that referenced
this pull request
Apr 18, 2026
- Move `byId` rekey into `subworkflows/local/utils_nfcore_rnaseq_pipeline`
as a shared function; main.nf and multiqc_rnaseq both include it.
Kills the two-copy DRY violation the reviewer flagged.
- Drop the pass-branch `[meta, []]` placeholder `.mix`es and the
subsequent `.flatten()` in both fail_trimmed and fail_mapped streams:
`.join(..., remainder: true)` in the collapse already handles absent
samples, so the placeholders were doing nothing except forcing the
flatten.
- Compute `skip:` for the merged-mode fail_* `collectFile` from
`sample_status_header.readLines().size() + 1` rather than the hardcoded
`4`, so editing the header file no longer silently mis-skips.
- Final collapse switches from `findAll()` (which relied on Groovy truth)
to `findAll { it != null }`; explicit and safer once placeholders are
out. One-line note on why `.flatten()` would be wrong here (Paths are
iterable).
- Delete the now-unreferenced `multiqcTsvFromList` function from
`subworkflows/nf-core/fastq_qc_trim_filter_setstrandedness`.
- Tighten the bundle-join comment in main.nf to spell out the per-
contributor (not workflow-global) barrier that `remainder: true`
introduces for unmatched samples, per reviewer §2.
- Trim the `ch_collated_versions` omission block to six lines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Alongside the existing `stats` / `flagstat` / `idxstats` emits (which mix genome + transcriptome samtools output, two tuples per sample), add three new named emits — `genome_stats`, `genome_flagstat`, `genome_idxstats` — sourced from `UMI_DEDUP_GENOME` only. These are one-tuple-per-sample and can be joined 1:1 against a per-sample bundle, enabling progressive per-sample MultiQC. The transcriptome outputs are deliberately not bundled (same rationale as the existing in-subworkflow `multiqc_files` selection).
…ication_merge In `--skip_quantification_merge` mode MULTIQC blocked until every sample finished because `ch_multiqc_files` was grouped via `groupTuple` on the full mixed queue, which only closes once the slowest cross-sample contributor closes (ultimately the workflow- global `versions` topic). Replace the mixed-queue-plus-`groupTuple` pattern with a per-sample bundle: each contributor extends a `[id, meta, file...]` tuple keyed on `meta.id` via `.join(..., remainder: true)`, and MULTIQC_RNASEQ collapses it to `[meta, files_list]`. Matched samples fire progressively — each sample's MULTIQC task submits as soon as its contributors have emitted. Verified under a synthetic 60 s delay on one sample's `SALMON_QUANT`: the other samples' MULTIQC tasks fire within milliseconds of each other, the slowed sample ~50 s later. Multi-output subworkflows (FASTQ_QC, BAM_QC_RNASEQ, BAM_MARKDUPLICATES, RUSTQC, BAM_DEDUP_UMI, alignment stats triples) are aggregated on meta internally before being folded into the bundle once, so the bundle grows by one position per subworkflow rather than per output. fail_trimmed / fail_mapped / fail_strand TSVs are built per sample — anchored on the all-fastq-samples channel so every bundle sample has a match, keeping the join progressive — plus a merged-mode aggregate for the flat `ch_multiqc_files` path. The in-subworkflow `fail_trimmed_samples_mqc.tsv` aggregation in `subworkflows/nf-core/fastq_qc_trim_filter_setstrandedness` is removed; the caller owns per-sample and merged forms. Per-sample reports carry a pipeline-identity YAML (`multiqc_software_versions.txt` = name + version + commit SHA + Nextflow version, via `workflowVersionToYAML()`) rather than the full collated tool versions; the full versions YAML is still published unchanged to `pipeline_info/`. The full YAML would only close once the workflow-global `versions` topic closes, reintroducing a full-run barrier. Closes #1797. Supersedes #1800 (groupKey count helper), #1802 (`.join(remainder: true)` with the pre-existing bundle shape).
6c692e6 to
77c76de
Compare
…count
Under `--skip_trimming` the FASTQ_QC subworkflow never emits
`trim_read_count`, so the per-sample anchor used to thread fail_*
streams through the bundle was empty. When `ch_strand_comparison`
still had items (alignment runs regardless of trimming), the
anchor-side-empty remainder join produced right-only unmatched tuples
`[id, null, f]`, and the downstream re-key `.map { meta, f -> [meta.id, f] }`
then hit a null meta and threw "Cannot get property 'id' on null object".
`ch_fastq` is the canonical "every fastq-branch sample" channel in the
workflow — already passed to the subworkflow for name replacements —
and is populated whether or not trimming runs. Use it as the anchor.
…ch samples The previous anchor (`ch_fastq`) is empty for the BAM-only input case (samples that come in via samplesheet `genome_bam` / `transcriptome_bam` columns under `--skip_alignment`), so `ch_fail_*_all` had no left side and the right-only `.join(..., remainder: true)` produced `[id, null, f]` tuples whose downstream re-key exploded with "Cannot get property 'id' on null object". Expose the bundle seed in `workflows/rnaseq/main.nf` as `ch_mqc_bundle_seed` (mix of the fastq and pre-aligned BAM branches) and pass it through as the subworkflow's per-sample anchor. Every bundle sample now has a matching placeholder on every fail_* stream regardless of which branch it came in on. Verified on the VM: default, default-stub, skip_quantification_merge (real + stub), skip_trimming, and both bam_input cases all pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1797. In
--skip_quantification_mergemode the MULTIQC step blocks until every sample has finished, becausech_multiqc_filesis one mixed queue grouped bygroupTuple— and that groupTuple doesn't close until the slowest cross-sample contributor (ultimately the workflow-globalversionstopic) closes.This PR replaces the single mixed queue with a per-sample bundle. Each MultiQC contributor joins a
[id, meta, file…]tuple keyed bymeta.id; the bundle grows per sample and closes per sample, so each sample's MULTIQC fires as soon as its own files are in. Matched samples fire progressively; contributors that never produce for a sample (feature off, optional upstream output, filter-excluded sample, pre-aligned BAM path) emitnullvia.join(..., remainder: true)and drop out on collapse.What moved where
workflows/rnaseq/main.nfch_mqc_per_sample_bundleseeded fromch_fastqplus the pre-aligned BAM branch (so BAM-only samples also get per-sample reports)..mix(X.out.Y)intoch_multiqc_filesis paired with a.join(X.out.Y, remainder: true)extension of the per-sample bundle.metafirst, then fold into the bundle as a single grouped contribution — one re-key instead of one per output.subworkflows/local/multiqc_rnaseqfail_trimmed/fail_mapped/fail_strandper-sample TSVs and their merged-mode aggregates (previously scattered across main.nf and one nf-core subworkflow).[id, meta, file…]bundle into[meta, files_list]and adds the run-level globals (workflow summary, methods description, a pipeline-identity versions YAML in per-sample mode).dev, just accumulating the same contributor files and global context into one MULTIQC call.subworkflows/nf-core/fastq_qc_trim_filter_setstrandednessfail_trimmed_samples_mqc.tsvaggregation.trim_read_countis already an emit, so the caller builds per-sample or aggregate forms from it. Will need a matching nf-core/subworkflows PR before final merge.subworkflows/nf-core/bam_dedup_umigenome_stats/genome_flagstat/genome_idxstatsemits alongside the existing pre-mixedstats/flagstat/idxstats. The pre-mixed emits cover genome+transcriptome (two tuples per sample) which can't be 1:1 joined; the new emits are genome-only and match the subworkflow's ownmultiqc_filesselection. Will need a matching nf-core/subworkflows PR before final merge.Why
remainder: trueA contributor's channel closes when its per-sample process finishes (fast), not when the workflow finishes. So
remainder: trueonly waits the per-contributor amount for unmatched samples — no workflow-global barrier. Verified with a 60 s artificial delay on one sample'sSALMON_QUANT: the other samples' MULTIQC tasks fire within milliseconds of each other, the slowed sample fires ~50 s later.Known limitation
Per-sample reports carry a pipeline-identity
multiqc_software_versions.txt(pipeline name + version + commit SHA + Nextflow version, via the existingworkflowVersionToYAML()helper) rather than the full collated tool versions. The full YAML would close only after the workflow-globalversionstopic closes, reintroducing a full-run barrier. The full collated YAML is still published unchanged topipeline_info/for consumers who need it.Tests
All existing snapshot tests pass without edits:
tests/default.nf.test,tests/default.nf.test(stub),tests/skip_quantification_merge.nf.test,tests/skip_quantification_merge.nf.test(stub),tests/skip_trimming.nf.test, the four UMI cases (tests/umi.nf.test, including stub), and the two BAM-input cases (tests/bam_input.nf.test).Progressive-closure trace (on
--skip_quantification_merge --pseudo_aligner salmon --skip_alignmentwith a 60 s sleep onWT_REP2'sSALMON_QUANT):Supersedes
#1800 (groupKey count helper), #1802 (
.join(remainder: true)with the pre-existing bundle shape).