feat(qc): recombination: D: Reversion Clustering by ivan-aksamentov · Pull Request #1739 · nextstrain/nextclade

ivan-aksamentov · 2026-01-13T13:11:23Z

Recombination Detection: Strategy D - Reversion Clustering

Scientific Motivation

Recombinant sequences inherit different genomic regions from distinct parental lineages. When a sequence is compared to a single reference or assigned clade, positions where the "wrong" parent was inherited appear as reversions - mutations back to an ancestral state.

These reversions cluster spatially because recombination breakpoints define contiguous regions inherited from each parent. A sequence inheriting region A from parent X and region B from parent Y will show reversions concentrated in whichever region mismatches the assigned lineage. Non-recombinant sequences typically have reversions scattered randomly across the genome.

The key insight is that clustered reversions indicate mosaic inheritance - a hallmark of recombination.

Mechanism

The algorithm proceeds in three steps:

Reversion ratio calculation: Compute the fraction of private substitutions that are reversions: reversionRatio = numReversions / totalPrivateSubstitutions. A high ratio (>30% by default) suggests lineage mixing.
Cluster detection: Group reversion positions using a sliding window approach. Positions within clusterWindowSize nucleotides of each other are grouped. Only groups meeting minClusterSize are retained as clusters.
Scoring: If both conditions are met (ratio above threshold AND at least one cluster exists), the score is: reversionRatio * numClusters * weight. This rewards both higher reversion density and multiple distinct clusters (suggesting multiple breakpoints).

Configuration

Add to pathogen.json under qc.recombinants:

{
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "reversionClustering": {
        "enabled": true,
        "weight": 50.0,
        "ratioThreshold": 0.3,
        "clusterWindowSize": 500,
        "minClusterSize": 3
      }
    }
  }
}

Parameters:

enabled: Activate this strategy
weight: Multiplier for the strategy's contribution to overall score
ratioThreshold: Minimum reversion ratio to trigger detection (0.3 = 30%)
clusterWindowSize: Maximum gap (bp) between positions in same cluster
minClusterSize: Minimum reversions required to form a cluster

Advantages

Exploits a biological signal unique to recombinants
Works without requiring labeled mutations or multiple ancestors
Computationally efficient - linear time in number of reversions
Detects recombination even when one parent is the assigned clade
Parameters are intuitive and biologically interpretable

Limitations

Requires sufficient reversions to form clusters (may miss recent recombinants)
Cannot distinguish recombination from convergent evolution in hotspots
Window size and thresholds may need tuning per pathogen
Does not identify breakpoint locations precisely
Less effective when both parents are equally distant from reference

Comparison to Other Strategies

vs Strategy A (Weighted Threshold): A counts mutations; D examines spatial distribution. D can detect recombinants with moderate mutation counts if reversions are clustered.
vs Strategy B (Spatial Uniformity): B looks at all mutations; D focuses specifically on reversions which are more diagnostic of recombination.
vs Strategy C (Cluster Gaps): C uses SNP clusters from a separate QC rule; D is self-contained using only private mutation data.
vs Strategy E (Multi-Ancestor): E requires ancestral search results; D works with single-reference analysis.
vs Strategy F (Label Switching): F requires mutation labels in dataset; D works on any dataset with a reference.

Choose D when: datasets lack labels, ancestral search is disabled, or you want a signal complementary to total mutation counts.

Implementation Summary

Files modified:

packages/nextclade/src/qc/qc_config.rs - Added QcRecombConfigReversionClustering
packages/nextclade/src/qc/qc_rule_recombinants.rs - Added strategy_reversion_clustering()
packages/nextclade/src/qc/qc_recomb_utils.rs - Added find_position_clusters() utility
packages/nextclade/src/qc/mod.rs - Registered new modules
packages/nextclade-web/src/helpers/formatQCRecombinants.ts - UI formatting for clusters
packages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx - Display integration
packages/nextclade-schemas/*.schema.{json,yaml} - Updated JSON schemas

Test coverage:

Unit tests for find_position_clusters() with various cluster configurations
Unit tests for strategy_reversion_clustering() covering disabled, empty, below-threshold, no-clusters, single-cluster, and multiple-cluster scenarios
Test dataset: data/recomb/enpen/enterovirus/ev-d68/ with sequences showing reversion patterns

Future Work

Adaptive threshold based on genome length and mutation rate
Report detected cluster positions in output for visualization
Combine with breakpoint detection algorithms
Weight clusters by size (larger clusters = stronger signal)
Consider gaps between clusters as additional signal

## Recombination Detection: Strategy D - Reversion Clustering ### Scientific Motivation Recombinant sequences inherit different genomic regions from distinct parental lineages. When a sequence is compared to a single reference or assigned clade, positions where the "wrong" parent was inherited appear as reversions - mutations back to an ancestral state. These reversions cluster spatially because recombination breakpoints define contiguous regions inherited from each parent. A sequence inheriting region A from parent X and region B from parent Y will show reversions concentrated in whichever region mismatches the assigned lineage. Non-recombinant sequences typically have reversions scattered randomly across the genome. The key insight is that clustered reversions indicate mosaic inheritance - a hallmark of recombination. ### Mechanism The algorithm proceeds in three steps: 1. **Reversion ratio calculation**: Compute the fraction of private substitutions that are reversions: `reversionRatio = numReversions / totalPrivateSubstitutions`. A high ratio (>30% by default) suggests lineage mixing. 2. **Cluster detection**: Group reversion positions using a sliding window approach. Positions within `clusterWindowSize` nucleotides of each other are grouped. Only groups meeting `minClusterSize` are retained as clusters. 3. **Scoring**: If both conditions are met (ratio above threshold AND at least one cluster exists), the score is: `reversionRatio * numClusters * weight`. This rewards both higher reversion density and multiple distinct clusters (suggesting multiple breakpoints). ### Configuration Add to `pathogen.json` under `qc.recombinants`: ```json { "qc": { "recombinants": { "enabled": true, "scoreWeight": 100.0, "reversionClustering": { "enabled": true, "weight": 50.0, "ratioThreshold": 0.3, "clusterWindowSize": 500, "minClusterSize": 3 } } } } ``` Parameters: - `enabled`: Activate this strategy - `weight`: Multiplier for the strategy's contribution to overall score - `ratioThreshold`: Minimum reversion ratio to trigger detection (0.3 = 30%) - `clusterWindowSize`: Maximum gap (bp) between positions in same cluster - `minClusterSize`: Minimum reversions required to form a cluster ### Advantages - Exploits a biological signal unique to recombinants - Works without requiring labeled mutations or multiple ancestors - Computationally efficient - linear time in number of reversions - Detects recombination even when one parent is the assigned clade - Parameters are intuitive and biologically interpretable ### Limitations - Requires sufficient reversions to form clusters (may miss recent recombinants) - Cannot distinguish recombination from convergent evolution in hotspots - Window size and thresholds may need tuning per pathogen - Does not identify breakpoint locations precisely - Less effective when both parents are equally distant from reference ### Comparison to Other Strategies - **vs Strategy A (Weighted Threshold)**: A counts mutations; D examines spatial distribution. D can detect recombinants with moderate mutation counts if reversions are clustered. - **vs Strategy B (Spatial Uniformity)**: B looks at all mutations; D focuses specifically on reversions which are more diagnostic of recombination. - **vs Strategy C (Cluster Gaps)**: C uses SNP clusters from a separate QC rule; D is self-contained using only private mutation data. - **vs Strategy E (Multi-Ancestor)**: E requires ancestral search results; D works with single-reference analysis. - **vs Strategy F (Label Switching)**: F requires mutation labels in dataset; D works on any dataset with a reference. Choose D when: datasets lack labels, ancestral search is disabled, or you want a signal complementary to total mutation counts. ### Implementation Summary Files modified: - `packages/nextclade/src/qc/qc_config.rs` - Added QcRecombConfigReversionClustering - `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Added strategy_reversion_clustering() - `packages/nextclade/src/qc/qc_recomb_utils.rs` - Added find_position_clusters() utility - `packages/nextclade/src/qc/mod.rs` - Registered new modules - `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - UI formatting for clusters - `packages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx` - Display integration - `packages/nextclade-schemas/*.schema.{json,yaml}` - Updated JSON schemas Test coverage: - Unit tests for find_position_clusters() with various cluster configurations - Unit tests for strategy_reversion_clustering() covering disabled, empty, below-threshold, no-clusters, single-cluster, and multiple-cluster scenarios - Test dataset: data/recomb/enpen/enterovirus/ev-d68/ with sequences showing reversion patterns ### Future Work - Adaptive threshold based on genome length and mutation rate - Report detected cluster positions in output for visualization - Combine with breakpoint detection algorithms - Weight clusters by size (larger clusters = stronger signal) - Consider gaps between clusters as additional signal Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2026-01-13T14:21:21Z

Preview: https://nextstrain--nextclade--pr-1739.previews.neherlab.click

(ci)

ivan-aksamentov · 2026-01-13T14:36:39Z

Test with strategy-specific dataset:

Preview with EV-D68 test dataset

Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.

ivan-aksamentov requested a review from Copilot January 13, 2026 13:13

Copilot started reviewing on behalf of ivan-aksamentov January 13, 2026 13:13 View session

ivan-aksamentov mentioned this pull request Jan 13, 2026

QC label for recombinant sequences #1699

Open

This comment was marked as resolved.

Sign in to view

fix: remove unnecessary deprecation attribute

0ce1fe1

Co-Authored-By: Claude <noreply@anthropic.com>

ivan-aksamentov mentioned this pull request Jan 20, 2026

feat(qc): add combined recombination detection strategies B+C+D+F #1742

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qc): recombination: D: Reversion Clustering#1739

feat(qc): recombination: D: Reversion Clustering#1739
ivan-aksamentov wants to merge 2 commits intomasterfrom
feat/qc-recomb-strategy-d

ivan-aksamentov commented Jan 13, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

ivan-aksamentov commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivan-aksamentov commented Jan 13, 2026

Recombination Detection: Strategy D - Reversion Clustering

Scientific Motivation

Mechanism

Configuration

Advantages

Limitations

Comparison to Other Strategies

Implementation Summary

Future Work

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

ivan-aksamentov commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants