feat(qc): recombination: D: Reversion Clustering#1739
Open
ivan-aksamentov wants to merge 2 commits intomasterfrom
Open
feat(qc): recombination: D: Reversion Clustering#1739ivan-aksamentov wants to merge 2 commits intomasterfrom
ivan-aksamentov wants to merge 2 commits intomasterfrom
Conversation
## Recombination Detection: Strategy D - Reversion Clustering
### Scientific Motivation
Recombinant sequences inherit different genomic regions from distinct parental
lineages. When a sequence is compared to a single reference or assigned clade,
positions where the "wrong" parent was inherited appear as reversions - mutations
back to an ancestral state.
These reversions cluster spatially because recombination breakpoints define
contiguous regions inherited from each parent. A sequence inheriting region A
from parent X and region B from parent Y will show reversions concentrated in
whichever region mismatches the assigned lineage. Non-recombinant sequences
typically have reversions scattered randomly across the genome.
The key insight is that clustered reversions indicate mosaic inheritance -
a hallmark of recombination.
### Mechanism
The algorithm proceeds in three steps:
1. **Reversion ratio calculation**: Compute the fraction of private substitutions
that are reversions: `reversionRatio = numReversions / totalPrivateSubstitutions`.
A high ratio (>30% by default) suggests lineage mixing.
2. **Cluster detection**: Group reversion positions using a sliding window approach.
Positions within `clusterWindowSize` nucleotides of each other are grouped.
Only groups meeting `minClusterSize` are retained as clusters.
3. **Scoring**: If both conditions are met (ratio above threshold AND at least
one cluster exists), the score is: `reversionRatio * numClusters * weight`.
This rewards both higher reversion density and multiple distinct clusters
(suggesting multiple breakpoints).
### Configuration
Add to `pathogen.json` under `qc.recombinants`:
```json
{
"qc": {
"recombinants": {
"enabled": true,
"scoreWeight": 100.0,
"reversionClustering": {
"enabled": true,
"weight": 50.0,
"ratioThreshold": 0.3,
"clusterWindowSize": 500,
"minClusterSize": 3
}
}
}
}
```
Parameters:
- `enabled`: Activate this strategy
- `weight`: Multiplier for the strategy's contribution to overall score
- `ratioThreshold`: Minimum reversion ratio to trigger detection (0.3 = 30%)
- `clusterWindowSize`: Maximum gap (bp) between positions in same cluster
- `minClusterSize`: Minimum reversions required to form a cluster
### Advantages
- Exploits a biological signal unique to recombinants
- Works without requiring labeled mutations or multiple ancestors
- Computationally efficient - linear time in number of reversions
- Detects recombination even when one parent is the assigned clade
- Parameters are intuitive and biologically interpretable
### Limitations
- Requires sufficient reversions to form clusters (may miss recent recombinants)
- Cannot distinguish recombination from convergent evolution in hotspots
- Window size and thresholds may need tuning per pathogen
- Does not identify breakpoint locations precisely
- Less effective when both parents are equally distant from reference
### Comparison to Other Strategies
- **vs Strategy A (Weighted Threshold)**: A counts mutations; D examines spatial
distribution. D can detect recombinants with moderate mutation counts if
reversions are clustered.
- **vs Strategy B (Spatial Uniformity)**: B looks at all mutations; D focuses
specifically on reversions which are more diagnostic of recombination.
- **vs Strategy C (Cluster Gaps)**: C uses SNP clusters from a separate QC rule;
D is self-contained using only private mutation data.
- **vs Strategy E (Multi-Ancestor)**: E requires ancestral search results; D
works with single-reference analysis.
- **vs Strategy F (Label Switching)**: F requires mutation labels in dataset; D
works on any dataset with a reference.
Choose D when: datasets lack labels, ancestral search is disabled, or you want
a signal complementary to total mutation counts.
### Implementation Summary
Files modified:
- `packages/nextclade/src/qc/qc_config.rs` - Added QcRecombConfigReversionClustering
- `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Added strategy_reversion_clustering()
- `packages/nextclade/src/qc/qc_recomb_utils.rs` - Added find_position_clusters() utility
- `packages/nextclade/src/qc/mod.rs` - Registered new modules
- `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - UI formatting for clusters
- `packages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx` - Display integration
- `packages/nextclade-schemas/*.schema.{json,yaml}` - Updated JSON schemas
Test coverage:
- Unit tests for find_position_clusters() with various cluster configurations
- Unit tests for strategy_reversion_clustering() covering disabled, empty, below-threshold,
no-clusters, single-cluster, and multiple-cluster scenarios
- Test dataset: data/recomb/enpen/enterovirus/ev-d68/ with sequences showing reversion patterns
### Future Work
- Adaptive threshold based on genome length and mutation rate
- Report detected cluster positions in output for visualization
- Combine with breakpoint detection algorithms
- Weight clusters by size (larger clusters = stronger signal)
- Consider gaps between clusters as additional signal
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Member
Author
|
Test with strategy-specific dataset: |
ivan-aksamentov
added a commit
that referenced
this pull request
Jan 20, 2026
Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.
ivan-aksamentov
added a commit
that referenced
this pull request
Jan 20, 2026
Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.
ivan-aksamentov
added a commit
that referenced
this pull request
Jan 20, 2026
Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Recombination Detection: Strategy D - Reversion Clustering
Scientific Motivation
Recombinant sequences inherit different genomic regions from distinct parental lineages. When a sequence is compared to a single reference or assigned clade, positions where the "wrong" parent was inherited appear as reversions - mutations back to an ancestral state.
These reversions cluster spatially because recombination breakpoints define contiguous regions inherited from each parent. A sequence inheriting region A from parent X and region B from parent Y will show reversions concentrated in whichever region mismatches the assigned lineage. Non-recombinant sequences typically have reversions scattered randomly across the genome.
The key insight is that clustered reversions indicate mosaic inheritance - a hallmark of recombination.
Mechanism
The algorithm proceeds in three steps:
Reversion ratio calculation: Compute the fraction of private substitutions that are reversions:
reversionRatio = numReversions / totalPrivateSubstitutions. A high ratio (>30% by default) suggests lineage mixing.Cluster detection: Group reversion positions using a sliding window approach. Positions within
clusterWindowSizenucleotides of each other are grouped. Only groups meetingminClusterSizeare retained as clusters.Scoring: If both conditions are met (ratio above threshold AND at least one cluster exists), the score is:
reversionRatio * numClusters * weight. This rewards both higher reversion density and multiple distinct clusters (suggesting multiple breakpoints).Configuration
Add to
pathogen.jsonunderqc.recombinants:{ "qc": { "recombinants": { "enabled": true, "scoreWeight": 100.0, "reversionClustering": { "enabled": true, "weight": 50.0, "ratioThreshold": 0.3, "clusterWindowSize": 500, "minClusterSize": 3 } } } }Parameters:
enabled: Activate this strategyweight: Multiplier for the strategy's contribution to overall scoreratioThreshold: Minimum reversion ratio to trigger detection (0.3 = 30%)clusterWindowSize: Maximum gap (bp) between positions in same clusterminClusterSize: Minimum reversions required to form a clusterAdvantages
Limitations
Comparison to Other Strategies
Choose D when: datasets lack labels, ancestral search is disabled, or you want a signal complementary to total mutation counts.
Implementation Summary
Files modified:
packages/nextclade/src/qc/qc_config.rs- Added QcRecombConfigReversionClusteringpackages/nextclade/src/qc/qc_rule_recombinants.rs- Added strategy_reversion_clustering()packages/nextclade/src/qc/qc_recomb_utils.rs- Added find_position_clusters() utilitypackages/nextclade/src/qc/mod.rs- Registered new modulespackages/nextclade-web/src/helpers/formatQCRecombinants.ts- UI formatting for clusterspackages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx- Display integrationpackages/nextclade-schemas/*.schema.{json,yaml}- Updated JSON schemasTest coverage:
Future Work