epic: Better Reverse Translation

## Problem

Three gaps in MaveDB's current reverse translation system:

1. **Coverage ceiling**: We can only generate reverse translations for variants already registered in ClinGen. Variants absent from ClinGen have no nucleotide-level equivalents surfaced.
2. **Annotation linkage**: Derived nucleotide alleles (the NT variants encoding a protein change) have no first-class representation in our data model and cannot be independently annotated.
3. **API semantics**: There is no structured way to communicate to the UI and API consumers that a set of nucleotide variants are categorical representations of a protein-level scored variant — not directly measured variants.

## Architecture

**Data model** (`AssayedVariant` → `MappingRecord` → `Allele`):

- `MappingRecord` replaces `MappedVariant` as the provenance record per mapping run. It carries `vrs_digest` (indexed, pre-mapped variant), `pre_mapped` JSONB, `assay_level`, mapping metadata, and QC fields. It has a M:N relationship to `Allele` rows via a `mapping_record_alleles` association table.
- `Allele` is a **flat table** (no inheritance) deduplicated by VRS digest across all score sets. Key columns:

  | Column | Notes |
  |---|---|
  | `vrs_digest` | unique |
  | `level` | enum: `genomic` \| `coding` \| `protein` |
  | `transcript` | NOT NULL — present for all levels |
  | `hgvs_g` / `hgvs_c` / `hgvs_p` | nullable; populated in post-processing, enforced at application layer |
  | `clingen_allele_id` | nullable, populated where available |
  | `post_mapped` | JSONB — raw mapper output for this allele at this level |

  The same allele appearing in multiple score sets shares a single row; annotation results are shared accordingly.

- Annotation FKs point to `allele_id`. `AnnotationStatus` is scoped to QC/audit only. Annotation data lives in first-class per-type tables with `superseded_at` for temporal queries: `VEPAnnotation` (new), `GnomADVariant` (updated), `ClinicalControl` (updated).

**Schema rule**: Fields stable by construction (HGVS strings, ClinGen IDs) live as columns on the `alleles` / `mapping_records` tables. External interpretations subject to revision (VEP, gnomAD, ClinVar) live in temporal annotation tables with `superseded_at`.

**Pipeline**:

1. Mapping pipeline (dcd_mapping) produces a `MappingRecord` + `Allele` rows at all applicable levels for every variant. For protein-level targets, reverse translation enumerates all coding variants encoding each amino acid change.
2. Coding and genomic `Allele` rows (`level = 'coding'` or `level = 'genomic'`) are submitted to ClinGen pre-registration. Alleles already registered are skipped.
3. Existing annotation jobs (VEP, ClinVar, gnomAD) are extended to annotate `Allele` rows via `allele_id`, with level-appropriate routing driven by the `level` column.

**API transit**:

- CatVRS (GA4GH Categorical Variation Representation Specification) is used as a transit layer in API responses to express that a set of coding alleles share an implied score — they encode a protein change that was scored, but are not scored directly.
- Traversal: `AssayedVariant` with `MappingRecord.assay_level = protein` → `mapping_record_alleles` → `Allele` rows where `level = 'coding'` → CatVRS members.
- Storage remains in the `alleles` table; CatVRS lives only in responses.

## Child Issues

### dcd_mapping2

- [feat: Redesign mapper output to produce MappingRecord + Allele rows at all levels dcd_mapping2#100](https://github.com/VariantEffect/dcd_mapping2/issues/100) — Redesign mapper output interface

### Data Model & Storage

- [feat: Add Allele data model and MappingRecord migration #739](https://github.com/VariantEffect/mavedb-api/issues/739) — Replace MappedVariant with new schema
- [feat: Persist MappingRecord and Allele rows from mapping pipeline results #740](https://github.com/VariantEffect/mavedb-api/issues/740) — Ingest new mapper output

### ClinGen Pre-Registration

- [feat: Pre-register coding and genomic Allele rows with ClinGen #741](https://github.com/VariantEffect/mavedb-api/issues/741) — Guarantee ClinGen coverage before translation lookup

### Annotation Pipeline

- [feat: Extend annotation pipeline to cover Allele entities #742](https://github.com/VariantEffect/mavedb-api/issues/742) — Annotate alleles at all levels

### API Layer

- [spike: Design variant + annotation API shape for score set and variant pages #743](https://github.com/VariantEffect/mavedb-api/issues/743) — API contract design _(blocks [feat: Implement CatVRS transit layer on variant endpoints #744](https://github.com/VariantEffect/mavedb-api/issues/744))_
- [feat: Implement CatVRS transit layer on variant endpoints #744](https://github.com/VariantEffect/mavedb-api/issues/744) — CatVRS transit layer

### Backfill

- [feat: Retroactive backfill — migrate MappedVariant data and run reverse translation for existing score sets #747](https://github.com/VariantEffect/mavedb-api/issues/747) — Retroactive coverage for existing data

### Deferred

- [refactor: Rename Variant → AssayedVariant and MappedVariant → MappingRecord #745](https://github.com/VariantEffect/mavedb-api/issues/745) — Naming refactor


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epic: Better Reverse Translation #746

Problem

Architecture

Child Issues

dcd_mapping2

Data Model & Storage

ClinGen Pre-Registration

Annotation Pipeline

API Layer

Backfill

Deferred

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Column	Notes
`vrs_digest`	unique
`level`	enum: `genomic` \| `coding` \| `protein`
`transcript`	NOT NULL — present for all levels
`hgvs_g` / `hgvs_c` / `hgvs_p`	nullable; populated in post-processing, enforced at application layer
`clingen_allele_id`	nullable, populated where available
`post_mapped`	JSONB — raw mapper output for this allele at this level

epic: Better Reverse Translation #746

Description

Problem

Architecture

Child Issues

dcd_mapping2

Data Model & Storage

ClinGen Pre-Registration

Annotation Pipeline

API Layer

Backfill

Deferred

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions