Skip to content

Association instances require manually-supplied IDs, producing non-deterministic identifiers #1707

@kevinschaper

Description

@kevinschaper

Problem

The Association class (and all ~75 subclasses) requires an id field, but the schema provides no mechanism for generating one. This forces every downstream consumer to invent its own ID generation strategy, typically resulting in random UUIDs or arbitrary placeholder values. The consequence: the same real-world association — same subject, predicate, object, and qualifiers — gets a different ID every time it is instantiated.

Concrete problems

1. Users must supply meaningless placeholder IDs

Since id is required, users who just want to construct an Association programmatically must invent a value. This leads to patterns like id=2, id="foo", or id=str(uuid.uuid4()) scattered across codebases (see e.g. #1507 where a user passes id=2 just to instantiate a GeneToGoTermAssociation). The required id becomes ceremony rather than a meaningful identifier.

2. Reproducibility breaks across pipeline runs

Running the same ingest pipeline twice on identical input produces different IDs. This makes diff-based change detection, incremental updates, and auditing harder than they need to be.

3. No standard identity semantics

The schema doesn't define what combination of fields makes an association unique. There are no id_prefixes on Association or any subclass, no unique_keys declarations, and no guidance on how to construct an identifier. Every project reinvents this.

The model already encodes identity-relevant structure

The biolink slot is_a hierarchy draws a clean line between qualifier slots (which refine an assertion's semantics) and metadata slots (evidence, provenance, sources). Every qualifier slot descends from qualifier; no metadata slot does. This structure could inform which fields participate in identity — but it's not currently leveraged.

Scope

  • 75+ Association subclasses affected, each with a different set of qualifier slots (ranging from 2 on base Association to 15+ on specialized subclasses)
  • No id_prefixes defined on Association or any subclass
  • No unique_keys defined anywhere in the Association hierarchy

Desired outcome

Association instances should support deterministic, content-based IDs derived from their populated fields, so that:

  • The same inputs always produce the same ID
  • Users can construct associations without manually supplying an ID
  • Explicitly-supplied IDs are still respected when provided

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions