-
Notifications
You must be signed in to change notification settings - Fork 85
Description
Problem
The Association class (and all ~75 subclasses) requires an id field, but the schema provides no mechanism for generating one. This forces every downstream consumer to invent its own ID generation strategy, typically resulting in random UUIDs or arbitrary placeholder values. The consequence: the same real-world association — same subject, predicate, object, and qualifiers — gets a different ID every time it is instantiated.
Concrete problems
1. Users must supply meaningless placeholder IDs
Since id is required, users who just want to construct an Association programmatically must invent a value. This leads to patterns like id=2, id="foo", or id=str(uuid.uuid4()) scattered across codebases (see e.g. #1507 where a user passes id=2 just to instantiate a GeneToGoTermAssociation). The required id becomes ceremony rather than a meaningful identifier.
2. Reproducibility breaks across pipeline runs
Running the same ingest pipeline twice on identical input produces different IDs. This makes diff-based change detection, incremental updates, and auditing harder than they need to be.
3. No standard identity semantics
The schema doesn't define what combination of fields makes an association unique. There are no id_prefixes on Association or any subclass, no unique_keys declarations, and no guidance on how to construct an identifier. Every project reinvents this.
The model already encodes identity-relevant structure
The biolink slot is_a hierarchy draws a clean line between qualifier slots (which refine an assertion's semantics) and metadata slots (evidence, provenance, sources). Every qualifier slot descends from qualifier; no metadata slot does. This structure could inform which fields participate in identity — but it's not currently leveraged.
Scope
- 75+ Association subclasses affected, each with a different set of qualifier slots (ranging from 2 on base
Associationto 15+ on specialized subclasses) - No
id_prefixesdefined on Association or any subclass - No
unique_keysdefined anywhere in the Association hierarchy
Desired outcome
Association instances should support deterministic, content-based IDs derived from their populated fields, so that:
- The same inputs always produce the same ID
- Users can construct associations without manually supplying an ID
- Explicitly-supplied IDs are still respected when provided