Skip to content

feat(qc): recombination: C: Cluster Gaps#1738

Open
ivan-aksamentov wants to merge 2 commits intomasterfrom
feat/qc-recomb-strategy-c
Open

feat(qc): recombination: C: Cluster Gaps#1738
ivan-aksamentov wants to merge 2 commits intomasterfrom
feat/qc-recomb-strategy-c

Conversation

@ivan-aksamentov
Copy link
Copy Markdown
Member

Recombination Detection: Strategy C - Cluster Gaps

Scientific Motivation

Viral recombination creates distinct genomic architecture patterns. When two parental lineages recombine, each contributes a contiguous segment of its genome to the offspring. Within each segment, the recombinant inherits the mutation profile of that parent. Since different lineages accumulate mutations in different positions, recombinants typically show mutation clusters that are spatially separated by large gaps - regions where neither parent has mutations relative to the reference.

The cluster gaps strategy exploits this breakpoint signature. A non-recombinant sequence acquires mutations gradually over time, resulting in more uniformly distributed SNPs across the genome. In contrast, a recombinant sequence inherits pre-existing mutation clusters from each parent, with clean gaps between them marking the approximate locations of recombination breakpoints.

Mechanism

This strategy reuses the SNP cluster detection already performed by the snpClusters QC rule. The algorithm:

  1. Retrieves the list of SNP clusters from QcResultSnpClusters.clusteredSnps
  2. If fewer than 2 clusters exist, returns zero score (no gap evidence)
  3. Computes gaps between consecutive clusters: gap = cluster[i+1].start - cluster[i].end
  4. Identifies the maximum gap size
  5. Computes score only if maxGap >= minGapSize:
    • score = (numClusters - 1) * weightPerGap + (maxGap / minGapSize) * weightGapSize

The score combines two signals:

  • Number of gaps (more clusters = more potential breakpoints)
  • Largest gap size (wider gaps = stronger breakpoint evidence)

Configuration

The clusterGaps strategy requires the snpClusters rule to be enabled in pathogen.json. Configuration parameters:

{
  "qc": {
    "snpClusters": {
      "enabled": true,
      "windowSize": 100,
      "clusterCutOff": 5,
      "scoreWeight": 50.0
    },
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "clusterGaps": {
        "enabled": true,
        "minGapSize": 1000,
        "weightPerGap": 25.0,
        "weightGapSize": 0.01
      }
    }
  }
}

Parameters:

  • minGapSize: Minimum gap (bp) to trigger scoring (default: 1000)
  • weightPerGap: Score contribution per gap (default: 25.0)
  • weightGapSize: Score contribution scaled by gap/minGapSize ratio (default: 0.01)

Advantages

  • Leverages existing SNP cluster infrastructure - no additional computation for clustering
  • Directly targets breakpoint signatures - the gaps between clusters
  • Low false positive rate for clean recombinants with distinct parental contributions
  • Interpretable output - cluster count and gap sizes are biologically meaningful
  • Works well when parental lineages have accumulated sufficient divergence

Limitations

  • Requires snpClusters rule to be enabled and configured appropriately
  • May miss recombinants with similar parental mutation profiles (small gaps)
  • Parameter tuning needed per pathogen based on typical mutation density
  • Cannot detect recombination when breakpoints occur within mutation-dense regions
  • Single-breakpoint recombinants may not produce multiple distinct clusters

Comparison to Other Strategies

Strategy Signal Best For
A: Weighted Threshold Total mutation excess High-divergence recombinants

Strategy C is most effective when recombination breakpoints create clean boundaries between parental contributions. It complements Strategy B (spatial uniformity) - B detects general non-uniformity while C specifically targets the gap pattern.

Implementation Summary

Files modified:

  • packages/nextclade/src/qc/qc_config.rs - Added QcRecombConfigClusterGaps config struct
  • packages/nextclade/src/qc/qc_rule_recombinants.rs - Implemented strategy_cluster_gaps function
  • packages/nextclade-web/src/helpers/formatQCRecombinants.ts - UI formatting for cluster gaps
  • packages/nextclade-schemas/*.schema.{json,yaml} - Schema updates for new fields

Key implementation details:

  • RecombResultClusterGaps struct captures: numClusters, maxGap, gaps array, score
  • Gap calculation uses windows(2) on sorted cluster list
  • Scoring triggers only when maxGap exceeds minGapSize threshold
  • Web UI displays cluster count and max gap when >= 2 clusters detected

Tests added:

  • Disabled config returns None
  • No SNP clusters returns zero score
  • Single cluster returns zero score
  • Two clusters with gap below threshold returns zero score
  • Two clusters with gap above threshold returns expected score
  • Three clusters correctly calculates max gap
  • Exact threshold boundary case

Test dataset:

  • data/recomb/enpen/enterovirus/ev-d68/ - Enterovirus D68 dataset with recombinants enabled

Future Work

  • Adaptive minGapSize based on genome length or expected mutation density
  • Weighted gaps considering mutation count difference between clusters
  • Integration with breakpoint detection for visualization
  • Correlation with phylogenetic placement confidence

## Recombination Detection: Strategy C - Cluster Gaps

### Scientific Motivation

Viral recombination creates distinct genomic architecture patterns. When two parental
lineages recombine, each contributes a contiguous segment of its genome to the offspring.
Within each segment, the recombinant inherits the mutation profile of that parent. Since
different lineages accumulate mutations in different positions, recombinants typically
show mutation clusters that are spatially separated by large gaps - regions where neither
parent has mutations relative to the reference.

The cluster gaps strategy exploits this breakpoint signature. A non-recombinant sequence
acquires mutations gradually over time, resulting in more uniformly distributed SNPs
across the genome. In contrast, a recombinant sequence inherits pre-existing mutation
clusters from each parent, with clean gaps between them marking the approximate locations
of recombination breakpoints.

### Mechanism

This strategy reuses the SNP cluster detection already performed by the snpClusters QC
rule. The algorithm:

1. Retrieves the list of SNP clusters from `QcResultSnpClusters.clusteredSnps`
2. If fewer than 2 clusters exist, returns zero score (no gap evidence)
3. Computes gaps between consecutive clusters: `gap = cluster[i+1].start - cluster[i].end`
4. Identifies the maximum gap size
5. Computes score only if `maxGap >= minGapSize`:
   - `score = (numClusters - 1) * weightPerGap + (maxGap / minGapSize) * weightGapSize`

The score combines two signals:
- Number of gaps (more clusters = more potential breakpoints)
- Largest gap size (wider gaps = stronger breakpoint evidence)

### Configuration

The clusterGaps strategy requires the snpClusters rule to be enabled in pathogen.json.
Configuration parameters:

```json
{
  "qc": {
    "snpClusters": {
      "enabled": true,
      "windowSize": 100,
      "clusterCutOff": 5,
      "scoreWeight": 50.0
    },
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "clusterGaps": {
        "enabled": true,
        "minGapSize": 1000,
        "weightPerGap": 25.0,
        "weightGapSize": 0.01
      }
    }
  }
}
```

Parameters:
- `minGapSize`: Minimum gap (bp) to trigger scoring (default: 1000)
- `weightPerGap`: Score contribution per gap (default: 25.0)
- `weightGapSize`: Score contribution scaled by gap/minGapSize ratio (default: 0.01)

### Advantages

- Leverages existing SNP cluster infrastructure - no additional computation for clustering
- Directly targets breakpoint signatures - the gaps between clusters
- Low false positive rate for clean recombinants with distinct parental contributions
- Interpretable output - cluster count and gap sizes are biologically meaningful
- Works well when parental lineages have accumulated sufficient divergence

### Limitations

- Requires snpClusters rule to be enabled and configured appropriately
- May miss recombinants with similar parental mutation profiles (small gaps)
- Parameter tuning needed per pathogen based on typical mutation density
- Cannot detect recombination when breakpoints occur within mutation-dense regions
- Single-breakpoint recombinants may not produce multiple distinct clusters

### Comparison to Other Strategies

| Strategy | Signal | Best For |
|----------|--------|----------|
| A: Weighted Threshold | Total mutation excess | High-divergence recombinants |
| B: Spatial Uniformity | CV of mutations across segments | Any spatial clustering |
| **C: Cluster Gaps** | **Gaps between SNP clusters** | **Clear breakpoint signatures** |
| D: Reversion Clustering | Clustered reversions | Ancestral segment detection |
| E: Multi-Ancestor | Per-segment ancestor affinity | Multiple reference comparison |
| F: Label Switching | Labeled mutation regions | Lineage-annotated datasets |

Strategy C is most effective when recombination breakpoints create clean boundaries
between parental contributions. It complements Strategy B (spatial uniformity) - B
detects general non-uniformity while C specifically targets the gap pattern.

### Implementation Summary

Files modified:
- `packages/nextclade/src/qc/qc_config.rs` - Added `QcRecombConfigClusterGaps` config struct
- `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Implemented `strategy_cluster_gaps` function
- `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - UI formatting for cluster gaps
- `packages/nextclade-schemas/*.schema.{json,yaml}` - Schema updates for new fields

Key implementation details:
- `RecombResultClusterGaps` struct captures: numClusters, maxGap, gaps array, score
- Gap calculation uses `windows(2)` on sorted cluster list
- Scoring triggers only when maxGap exceeds minGapSize threshold
- Web UI displays cluster count and max gap when >= 2 clusters detected

Tests added:
- Disabled config returns None
- No SNP clusters returns zero score
- Single cluster returns zero score
- Two clusters with gap below threshold returns zero score
- Two clusters with gap above threshold returns expected score
- Three clusters correctly calculates max gap
- Exact threshold boundary case

Test dataset:
- `data/recomb/enpen/enterovirus/ev-d68/` - Enterovirus D68 dataset with recombinants enabled

### Future Work

- Adaptive minGapSize based on genome length or expected mutation density
- Weighted gaps considering mutation count difference between clusters
- Integration with breakpoint detection for visualization
- Correlation with phylogenetic placement confidence

Co-Authored-By: Claude <noreply@anthropic.com>

This comment was marked as resolved.

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

Test with strategy-specific dataset:

Preview with EV-D68 test dataset

ivan-aksamentov added a commit that referenced this pull request Jan 20, 2026
Closes #1699

Combines four recombination detection strategies:
- B: Spatial uniformity (PR #1737)
- C: Cluster gaps (PR #1738)
- D: Reversion clustering (PR #1739)
- F: Label switching (PR #1741)

Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/`

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:
```
nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta
```

Note: The current weighted score aggregation (simple sum of strategy
scores) is a temporary solution. The scoring mechanism needs further
discussion to determine optimal combination approach.
ivan-aksamentov added a commit that referenced this pull request Jan 20, 2026
Closes #1699

Combines four recombination detection strategies:
- B: Spatial uniformity (PR #1737)
- C: Cluster gaps (PR #1738)
- D: Reversion clustering (PR #1739)
- F: Label switching (PR #1741)

Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/`

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:
```
nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta
```

Note: The current weighted score aggregation (simple sum of strategy
scores) is a temporary solution. The scoring mechanism needs further
discussion to determine optimal combination approach.
ivan-aksamentov added a commit that referenced this pull request Jan 20, 2026
Closes #1699

Combines four recombination detection strategies:
- B: Spatial uniformity (PR #1737)
- C: Cluster gaps (PR #1738)
- D: Reversion clustering (PR #1739)
- F: Label switching (PR #1741)

Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/`

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:
```
nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta
```

Note: The current weighted score aggregation (simple sum of strategy
scores) is a temporary solution. The scoring mechanism needs further
discussion to determine optimal combination approach.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants