Skip to content

feat(qc): recombination: D: Reversion Clustering#1739

Open
ivan-aksamentov wants to merge 2 commits intomasterfrom
feat/qc-recomb-strategy-d
Open

feat(qc): recombination: D: Reversion Clustering#1739
ivan-aksamentov wants to merge 2 commits intomasterfrom
feat/qc-recomb-strategy-d

Conversation

@ivan-aksamentov
Copy link
Copy Markdown
Member

Recombination Detection: Strategy D - Reversion Clustering

Scientific Motivation

Recombinant sequences inherit different genomic regions from distinct parental lineages. When a sequence is compared to a single reference or assigned clade, positions where the "wrong" parent was inherited appear as reversions - mutations back to an ancestral state.

These reversions cluster spatially because recombination breakpoints define contiguous regions inherited from each parent. A sequence inheriting region A from parent X and region B from parent Y will show reversions concentrated in whichever region mismatches the assigned lineage. Non-recombinant sequences typically have reversions scattered randomly across the genome.

The key insight is that clustered reversions indicate mosaic inheritance - a hallmark of recombination.

Mechanism

The algorithm proceeds in three steps:

  1. Reversion ratio calculation: Compute the fraction of private substitutions that are reversions: reversionRatio = numReversions / totalPrivateSubstitutions. A high ratio (>30% by default) suggests lineage mixing.

  2. Cluster detection: Group reversion positions using a sliding window approach. Positions within clusterWindowSize nucleotides of each other are grouped. Only groups meeting minClusterSize are retained as clusters.

  3. Scoring: If both conditions are met (ratio above threshold AND at least one cluster exists), the score is: reversionRatio * numClusters * weight. This rewards both higher reversion density and multiple distinct clusters (suggesting multiple breakpoints).

Configuration

Add to pathogen.json under qc.recombinants:

{
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "reversionClustering": {
        "enabled": true,
        "weight": 50.0,
        "ratioThreshold": 0.3,
        "clusterWindowSize": 500,
        "minClusterSize": 3
      }
    }
  }
}

Parameters:

  • enabled: Activate this strategy
  • weight: Multiplier for the strategy's contribution to overall score
  • ratioThreshold: Minimum reversion ratio to trigger detection (0.3 = 30%)
  • clusterWindowSize: Maximum gap (bp) between positions in same cluster
  • minClusterSize: Minimum reversions required to form a cluster

Advantages

  • Exploits a biological signal unique to recombinants
  • Works without requiring labeled mutations or multiple ancestors
  • Computationally efficient - linear time in number of reversions
  • Detects recombination even when one parent is the assigned clade
  • Parameters are intuitive and biologically interpretable

Limitations

  • Requires sufficient reversions to form clusters (may miss recent recombinants)
  • Cannot distinguish recombination from convergent evolution in hotspots
  • Window size and thresholds may need tuning per pathogen
  • Does not identify breakpoint locations precisely
  • Less effective when both parents are equally distant from reference

Comparison to Other Strategies

  • vs Strategy A (Weighted Threshold): A counts mutations; D examines spatial distribution. D can detect recombinants with moderate mutation counts if reversions are clustered.
  • vs Strategy B (Spatial Uniformity): B looks at all mutations; D focuses specifically on reversions which are more diagnostic of recombination.
  • vs Strategy C (Cluster Gaps): C uses SNP clusters from a separate QC rule; D is self-contained using only private mutation data.
  • vs Strategy E (Multi-Ancestor): E requires ancestral search results; D works with single-reference analysis.
  • vs Strategy F (Label Switching): F requires mutation labels in dataset; D works on any dataset with a reference.

Choose D when: datasets lack labels, ancestral search is disabled, or you want a signal complementary to total mutation counts.

Implementation Summary

Files modified:

  • packages/nextclade/src/qc/qc_config.rs - Added QcRecombConfigReversionClustering
  • packages/nextclade/src/qc/qc_rule_recombinants.rs - Added strategy_reversion_clustering()
  • packages/nextclade/src/qc/qc_recomb_utils.rs - Added find_position_clusters() utility
  • packages/nextclade/src/qc/mod.rs - Registered new modules
  • packages/nextclade-web/src/helpers/formatQCRecombinants.ts - UI formatting for clusters
  • packages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx - Display integration
  • packages/nextclade-schemas/*.schema.{json,yaml} - Updated JSON schemas

Test coverage:

  • Unit tests for find_position_clusters() with various cluster configurations
  • Unit tests for strategy_reversion_clustering() covering disabled, empty, below-threshold, no-clusters, single-cluster, and multiple-cluster scenarios
  • Test dataset: data/recomb/enpen/enterovirus/ev-d68/ with sequences showing reversion patterns

Future Work

  • Adaptive threshold based on genome length and mutation rate
  • Report detected cluster positions in output for visualization
  • Combine with breakpoint detection algorithms
  • Weight clusters by size (larger clusters = stronger signal)
  • Consider gaps between clusters as additional signal

## Recombination Detection: Strategy D - Reversion Clustering

### Scientific Motivation

Recombinant sequences inherit different genomic regions from distinct parental
lineages. When a sequence is compared to a single reference or assigned clade,
positions where the "wrong" parent was inherited appear as reversions - mutations
back to an ancestral state.

These reversions cluster spatially because recombination breakpoints define
contiguous regions inherited from each parent. A sequence inheriting region A
from parent X and region B from parent Y will show reversions concentrated in
whichever region mismatches the assigned lineage. Non-recombinant sequences
typically have reversions scattered randomly across the genome.

The key insight is that clustered reversions indicate mosaic inheritance -
a hallmark of recombination.

### Mechanism

The algorithm proceeds in three steps:

1. **Reversion ratio calculation**: Compute the fraction of private substitutions
   that are reversions: `reversionRatio = numReversions / totalPrivateSubstitutions`.
   A high ratio (>30% by default) suggests lineage mixing.

2. **Cluster detection**: Group reversion positions using a sliding window approach.
   Positions within `clusterWindowSize` nucleotides of each other are grouped.
   Only groups meeting `minClusterSize` are retained as clusters.

3. **Scoring**: If both conditions are met (ratio above threshold AND at least
   one cluster exists), the score is: `reversionRatio * numClusters * weight`.
   This rewards both higher reversion density and multiple distinct clusters
   (suggesting multiple breakpoints).

### Configuration

Add to `pathogen.json` under `qc.recombinants`:

```json
{
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "reversionClustering": {
        "enabled": true,
        "weight": 50.0,
        "ratioThreshold": 0.3,
        "clusterWindowSize": 500,
        "minClusterSize": 3
      }
    }
  }
}
```

Parameters:
- `enabled`: Activate this strategy
- `weight`: Multiplier for the strategy's contribution to overall score
- `ratioThreshold`: Minimum reversion ratio to trigger detection (0.3 = 30%)
- `clusterWindowSize`: Maximum gap (bp) between positions in same cluster
- `minClusterSize`: Minimum reversions required to form a cluster

### Advantages

- Exploits a biological signal unique to recombinants
- Works without requiring labeled mutations or multiple ancestors
- Computationally efficient - linear time in number of reversions
- Detects recombination even when one parent is the assigned clade
- Parameters are intuitive and biologically interpretable

### Limitations

- Requires sufficient reversions to form clusters (may miss recent recombinants)
- Cannot distinguish recombination from convergent evolution in hotspots
- Window size and thresholds may need tuning per pathogen
- Does not identify breakpoint locations precisely
- Less effective when both parents are equally distant from reference

### Comparison to Other Strategies

- **vs Strategy A (Weighted Threshold)**: A counts mutations; D examines spatial
  distribution. D can detect recombinants with moderate mutation counts if
  reversions are clustered.
- **vs Strategy B (Spatial Uniformity)**: B looks at all mutations; D focuses
  specifically on reversions which are more diagnostic of recombination.
- **vs Strategy C (Cluster Gaps)**: C uses SNP clusters from a separate QC rule;
  D is self-contained using only private mutation data.
- **vs Strategy E (Multi-Ancestor)**: E requires ancestral search results; D
  works with single-reference analysis.
- **vs Strategy F (Label Switching)**: F requires mutation labels in dataset; D
  works on any dataset with a reference.

Choose D when: datasets lack labels, ancestral search is disabled, or you want
a signal complementary to total mutation counts.

### Implementation Summary

Files modified:
- `packages/nextclade/src/qc/qc_config.rs` - Added QcRecombConfigReversionClustering
- `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Added strategy_reversion_clustering()
- `packages/nextclade/src/qc/qc_recomb_utils.rs` - Added find_position_clusters() utility
- `packages/nextclade/src/qc/mod.rs` - Registered new modules
- `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - UI formatting for clusters
- `packages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx` - Display integration
- `packages/nextclade-schemas/*.schema.{json,yaml}` - Updated JSON schemas

Test coverage:
- Unit tests for find_position_clusters() with various cluster configurations
- Unit tests for strategy_reversion_clustering() covering disabled, empty, below-threshold,
  no-clusters, single-cluster, and multiple-cluster scenarios
- Test dataset: data/recomb/enpen/enterovirus/ev-d68/ with sequences showing reversion patterns

### Future Work

- Adaptive threshold based on genome length and mutation rate
- Report detected cluster positions in output for visualization
- Combine with breakpoint detection algorithms
- Weight clusters by size (larger clusters = stronger signal)
- Consider gaps between clusters as additional signal

Co-Authored-By: Claude <noreply@anthropic.com>

This comment was marked as resolved.

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

Test with strategy-specific dataset:

Preview with EV-D68 test dataset

ivan-aksamentov added a commit that referenced this pull request Jan 20, 2026
Closes #1699

Combines four recombination detection strategies:
- B: Spatial uniformity (PR #1737)
- C: Cluster gaps (PR #1738)
- D: Reversion clustering (PR #1739)
- F: Label switching (PR #1741)

Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/`

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:
```
nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta
```

Note: The current weighted score aggregation (simple sum of strategy
scores) is a temporary solution. The scoring mechanism needs further
discussion to determine optimal combination approach.
ivan-aksamentov added a commit that referenced this pull request Jan 20, 2026
Closes #1699

Combines four recombination detection strategies:
- B: Spatial uniformity (PR #1737)
- C: Cluster gaps (PR #1738)
- D: Reversion clustering (PR #1739)
- F: Label switching (PR #1741)

Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/`

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:
```
nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta
```

Note: The current weighted score aggregation (simple sum of strategy
scores) is a temporary solution. The scoring mechanism needs further
discussion to determine optimal combination approach.
ivan-aksamentov added a commit that referenced this pull request Jan 20, 2026
Closes #1699

Combines four recombination detection strategies:
- B: Spatial uniformity (PR #1737)
- C: Cluster gaps (PR #1738)
- D: Reversion clustering (PR #1739)
- F: Label switching (PR #1741)

Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/`

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:
```
nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta
```

Note: The current weighted score aggregation (simple sum of strategy
scores) is a temporary solution. The scoring mechanism needs further
discussion to determine optimal combination approach.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants