Skip to content

epic: Better Reverse Translation #746

@bencap

Description

@bencap

Problem

Three gaps in MaveDB's current reverse translation system:

  1. Coverage ceiling: We can only generate reverse translations for variants already registered in ClinGen. Variants absent from ClinGen have no nucleotide-level equivalents surfaced.
  2. Annotation linkage: Derived nucleotide alleles (the NT variants encoding a protein change) have no first-class representation in our data model and cannot be independently annotated.
  3. API semantics: There is no structured way to communicate to the UI and API consumers that a set of nucleotide variants are categorical representations of a protein-level scored variant — not directly measured variants.

Architecture

Data model (AssayedVariantMappingRecordAllele):

  • MappingRecord replaces MappedVariant as the provenance record per mapping run. It carries vrs_digest (indexed, pre-mapped variant), pre_mapped JSONB, assay_level, mapping metadata, and QC fields. It has a M:N relationship to Allele rows via a mapping_record_alleles association table.

  • Allele is a flat table (no inheritance) deduplicated by VRS digest across all score sets. Key columns:

    Column Notes
    vrs_digest unique
    level enum: genomic | coding | protein
    transcript NOT NULL — present for all levels
    hgvs_g / hgvs_c / hgvs_p nullable; populated in post-processing, enforced at application layer
    clingen_allele_id nullable, populated where available
    post_mapped JSONB — raw mapper output for this allele at this level

    The same allele appearing in multiple score sets shares a single row; annotation results are shared accordingly.

  • Annotation FKs point to allele_id. AnnotationStatus is scoped to QC/audit only. Annotation data lives in first-class per-type tables with superseded_at for temporal queries: VEPAnnotation (new), GnomADVariant (updated), ClinicalControl (updated).

Schema rule: Fields stable by construction (HGVS strings, ClinGen IDs) live as columns on the alleles / mapping_records tables. External interpretations subject to revision (VEP, gnomAD, ClinVar) live in temporal annotation tables with superseded_at.

Pipeline:

  1. Mapping pipeline (dcd_mapping) produces a MappingRecord + Allele rows at all applicable levels for every variant. For protein-level targets, reverse translation enumerates all coding variants encoding each amino acid change.
  2. Coding and genomic Allele rows (level = 'coding' or level = 'genomic') are submitted to ClinGen pre-registration. Alleles already registered are skipped.
  3. Existing annotation jobs (VEP, ClinVar, gnomAD) are extended to annotate Allele rows via allele_id, with level-appropriate routing driven by the level column.

API transit:

  • CatVRS (GA4GH Categorical Variation Representation Specification) is used as a transit layer in API responses to express that a set of coding alleles share an implied score — they encode a protein change that was scored, but are not scored directly.
  • Traversal: AssayedVariant with MappingRecord.assay_level = proteinmapping_record_allelesAllele rows where level = 'coding' → CatVRS members.
  • Storage remains in the alleles table; CatVRS lives only in responses.

Child Issues

dcd_mapping2

Data Model & Storage

ClinGen Pre-Registration

Annotation Pipeline

API Layer

Backfill

Deferred

Metadata

Metadata

Assignees

No one assigned

    Labels

    app: backendTask implementation touches the backendapp: databaseTask implementation requires database changesapp: frontendTask implementation touches the frontendapp: mapperTask implementation touches the mapperapp: workerTask implementation touches the workertype: featureNew featureworkstream: clinicalTask relates to clinical features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions