feat(qc): recombination: C: Cluster Gaps#1738
Open
ivan-aksamentov wants to merge 2 commits intomasterfrom
Open
feat(qc): recombination: C: Cluster Gaps#1738ivan-aksamentov wants to merge 2 commits intomasterfrom
ivan-aksamentov wants to merge 2 commits intomasterfrom
Conversation
## Recombination Detection: Strategy C - Cluster Gaps
### Scientific Motivation
Viral recombination creates distinct genomic architecture patterns. When two parental
lineages recombine, each contributes a contiguous segment of its genome to the offspring.
Within each segment, the recombinant inherits the mutation profile of that parent. Since
different lineages accumulate mutations in different positions, recombinants typically
show mutation clusters that are spatially separated by large gaps - regions where neither
parent has mutations relative to the reference.
The cluster gaps strategy exploits this breakpoint signature. A non-recombinant sequence
acquires mutations gradually over time, resulting in more uniformly distributed SNPs
across the genome. In contrast, a recombinant sequence inherits pre-existing mutation
clusters from each parent, with clean gaps between them marking the approximate locations
of recombination breakpoints.
### Mechanism
This strategy reuses the SNP cluster detection already performed by the snpClusters QC
rule. The algorithm:
1. Retrieves the list of SNP clusters from `QcResultSnpClusters.clusteredSnps`
2. If fewer than 2 clusters exist, returns zero score (no gap evidence)
3. Computes gaps between consecutive clusters: `gap = cluster[i+1].start - cluster[i].end`
4. Identifies the maximum gap size
5. Computes score only if `maxGap >= minGapSize`:
- `score = (numClusters - 1) * weightPerGap + (maxGap / minGapSize) * weightGapSize`
The score combines two signals:
- Number of gaps (more clusters = more potential breakpoints)
- Largest gap size (wider gaps = stronger breakpoint evidence)
### Configuration
The clusterGaps strategy requires the snpClusters rule to be enabled in pathogen.json.
Configuration parameters:
```json
{
"qc": {
"snpClusters": {
"enabled": true,
"windowSize": 100,
"clusterCutOff": 5,
"scoreWeight": 50.0
},
"recombinants": {
"enabled": true,
"scoreWeight": 100.0,
"clusterGaps": {
"enabled": true,
"minGapSize": 1000,
"weightPerGap": 25.0,
"weightGapSize": 0.01
}
}
}
}
```
Parameters:
- `minGapSize`: Minimum gap (bp) to trigger scoring (default: 1000)
- `weightPerGap`: Score contribution per gap (default: 25.0)
- `weightGapSize`: Score contribution scaled by gap/minGapSize ratio (default: 0.01)
### Advantages
- Leverages existing SNP cluster infrastructure - no additional computation for clustering
- Directly targets breakpoint signatures - the gaps between clusters
- Low false positive rate for clean recombinants with distinct parental contributions
- Interpretable output - cluster count and gap sizes are biologically meaningful
- Works well when parental lineages have accumulated sufficient divergence
### Limitations
- Requires snpClusters rule to be enabled and configured appropriately
- May miss recombinants with similar parental mutation profiles (small gaps)
- Parameter tuning needed per pathogen based on typical mutation density
- Cannot detect recombination when breakpoints occur within mutation-dense regions
- Single-breakpoint recombinants may not produce multiple distinct clusters
### Comparison to Other Strategies
| Strategy | Signal | Best For |
|----------|--------|----------|
| A: Weighted Threshold | Total mutation excess | High-divergence recombinants |
| B: Spatial Uniformity | CV of mutations across segments | Any spatial clustering |
| **C: Cluster Gaps** | **Gaps between SNP clusters** | **Clear breakpoint signatures** |
| D: Reversion Clustering | Clustered reversions | Ancestral segment detection |
| E: Multi-Ancestor | Per-segment ancestor affinity | Multiple reference comparison |
| F: Label Switching | Labeled mutation regions | Lineage-annotated datasets |
Strategy C is most effective when recombination breakpoints create clean boundaries
between parental contributions. It complements Strategy B (spatial uniformity) - B
detects general non-uniformity while C specifically targets the gap pattern.
### Implementation Summary
Files modified:
- `packages/nextclade/src/qc/qc_config.rs` - Added `QcRecombConfigClusterGaps` config struct
- `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Implemented `strategy_cluster_gaps` function
- `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - UI formatting for cluster gaps
- `packages/nextclade-schemas/*.schema.{json,yaml}` - Schema updates for new fields
Key implementation details:
- `RecombResultClusterGaps` struct captures: numClusters, maxGap, gaps array, score
- Gap calculation uses `windows(2)` on sorted cluster list
- Scoring triggers only when maxGap exceeds minGapSize threshold
- Web UI displays cluster count and max gap when >= 2 clusters detected
Tests added:
- Disabled config returns None
- No SNP clusters returns zero score
- Single cluster returns zero score
- Two clusters with gap below threshold returns zero score
- Two clusters with gap above threshold returns expected score
- Three clusters correctly calculates max gap
- Exact threshold boundary case
Test dataset:
- `data/recomb/enpen/enterovirus/ev-d68/` - Enterovirus D68 dataset with recombinants enabled
### Future Work
- Adaptive minGapSize based on genome length or expected mutation density
- Weighted gaps considering mutation count difference between clusters
- Integration with breakpoint detection for visualization
- Correlation with phylogenetic placement confidence
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Member
Author
|
Test with strategy-specific dataset: |
ivan-aksamentov
added a commit
that referenced
this pull request
Jan 20, 2026
Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.
ivan-aksamentov
added a commit
that referenced
this pull request
Jan 20, 2026
Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.
ivan-aksamentov
added a commit
that referenced
this pull request
Jan 20, 2026
Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Recombination Detection: Strategy C - Cluster Gaps
Scientific Motivation
Viral recombination creates distinct genomic architecture patterns. When two parental lineages recombine, each contributes a contiguous segment of its genome to the offspring. Within each segment, the recombinant inherits the mutation profile of that parent. Since different lineages accumulate mutations in different positions, recombinants typically show mutation clusters that are spatially separated by large gaps - regions where neither parent has mutations relative to the reference.
The cluster gaps strategy exploits this breakpoint signature. A non-recombinant sequence acquires mutations gradually over time, resulting in more uniformly distributed SNPs across the genome. In contrast, a recombinant sequence inherits pre-existing mutation clusters from each parent, with clean gaps between them marking the approximate locations of recombination breakpoints.
Mechanism
This strategy reuses the SNP cluster detection already performed by the snpClusters QC rule. The algorithm:
QcResultSnpClusters.clusteredSnpsgap = cluster[i+1].start - cluster[i].endmaxGap >= minGapSize:score = (numClusters - 1) * weightPerGap + (maxGap / minGapSize) * weightGapSizeThe score combines two signals:
Configuration
The clusterGaps strategy requires the snpClusters rule to be enabled in pathogen.json. Configuration parameters:
{ "qc": { "snpClusters": { "enabled": true, "windowSize": 100, "clusterCutOff": 5, "scoreWeight": 50.0 }, "recombinants": { "enabled": true, "scoreWeight": 100.0, "clusterGaps": { "enabled": true, "minGapSize": 1000, "weightPerGap": 25.0, "weightGapSize": 0.01 } } } }Parameters:
minGapSize: Minimum gap (bp) to trigger scoring (default: 1000)weightPerGap: Score contribution per gap (default: 25.0)weightGapSize: Score contribution scaled by gap/minGapSize ratio (default: 0.01)Advantages
Limitations
Comparison to Other Strategies
Strategy C is most effective when recombination breakpoints create clean boundaries between parental contributions. It complements Strategy B (spatial uniformity) - B detects general non-uniformity while C specifically targets the gap pattern.
Implementation Summary
Files modified:
packages/nextclade/src/qc/qc_config.rs- AddedQcRecombConfigClusterGapsconfig structpackages/nextclade/src/qc/qc_rule_recombinants.rs- Implementedstrategy_cluster_gapsfunctionpackages/nextclade-web/src/helpers/formatQCRecombinants.ts- UI formatting for cluster gapspackages/nextclade-schemas/*.schema.{json,yaml}- Schema updates for new fieldsKey implementation details:
RecombResultClusterGapsstruct captures: numClusters, maxGap, gaps array, scorewindows(2)on sorted cluster listTests added:
Test dataset:
data/recomb/enpen/enterovirus/ev-d68/- Enterovirus D68 dataset with recombinants enabledFuture Work