Skip to content

Comments

Add deterministic hash-based IDs for Association classes#1708

Draft
kevinschaper wants to merge 1 commit intomasterfrom
issue-1707-deterministic-association-ids
Draft

Add deterministic hash-based IDs for Association classes#1708
kevinschaper wants to merge 1 commit intomasterfrom
issue-1707-deterministic-association-ids

Conversation

@kevinschaper
Copy link
Collaborator

@kevinschaper kevinschaper commented Feb 18, 2026

Summary

Closes #1707

Association instances (and all ~75 subclasses) currently require a manually-supplied id, leading to non-deterministic identifiers — random UUIDs, placeholder values like id=2 (see #1507), or project-specific hashing scattered across codebases. This PR adds automatic deterministic ID generation: when no id is provided, a SHA-256 hash is computed from the association's identity-defining fields. Same inputs always produce the same ID. Explicitly-supplied IDs are preserved unchanged.

How it works

This feature sits at the intersection of three design decisions — in LinkML's generator architecture, in Pydantic v2's validation model, and in biolink's own schema structure — that together make the implementation clean and maintainable.

Layer 1: LinkML's PydanticGenerator was built to be extended

LinkML's PydanticGenerator is a @dataclass that inherits from OOCodeGenerator and LifecycleMixin. It was explicitly designed for subclassing: downstream projects can create a @dataclass subclass that inherits all parent fields and overrides specific lifecycle hooks without needing to copy-paste or monkey-patch generator internals.

The extension points we use:

  • injected_classes — a list of Python classes (as source strings or live types) that get merged into the generated module alongside the default LinkMLMeta. We inject IdStrategy, AssociationIdConfig, the two precomputed slot maps, and DeterministicIdMixin.
  • imports — an Imports container that merges with the generator's default imports (which already include pydantic.BaseModel, typing, etc.). We add hashlib, json, enum, and pydantic.model_validator.
  • after_render_template — a LifecycleMixin hook that receives the final serialized Python source as a string, after Jinja2 rendering and Black formatting. We use it to splice DeterministicIdMixin into Association's base classes: class Association(Entity):class Association(DeterministicIdMixin, Entity):.

Because PydanticGenerator is a dataclass, our BiolinkPydanticGenerator subclass can do all its work in __post_init__ — computing slot maps from the live SchemaView, populating injected_classes and imports, and then letting the parent's serialize() handle the rest. The gen-pydantic CLI hardcodes PydanticGenerator, so the Makefile calls our subclass directly via python scripts/biolink_pydantic_generator.py instead.

Layer 2: Pydantic v2's model_validator(mode='before') gives us a clean hook

The challenge: id is a required field on Association. If a user omits it, Pydantic's field validation will reject the input before any custom logic can run. Pydantic v2's model_validator(mode='before') solves this precisely:

  • It runs before any field validation or type coercion, receiving the raw input as a plain dict
  • The validator can inspect the dict, compute a deterministic ID from the other fields, and insert data["id"] — all before Pydantic checks whether id is present
  • If the user did supply an id, the validator detects it and passes through unchanged
  • Crucially, mode='before' validators are inherited by subclasses via Pydantic's MRO. Adding the validator to Association automatically covers all 104 descendant classes — GeneToDiseaseAssociation, ChemicalAffectsGeneAssociation, etc. — with zero per-class changes. Pydantic's DecoratorInfos.build() walks the MRO to collect validators from base classes and bind them to the current class.

The guard if isinstance(data, dict) is important: when models appear in Union discriminators, Pydantic may pass an already-instantiated object to the validator rather than a dict.

Layer 3: Biolink's slot is_a hierarchy already encodes identity semantics

The simplest approach would be to hash all fields (ALL_FIELDS strategy), and indeed this PR supports that. But the biolink schema already encodes a more sophisticated distinction between fields that define what an association asserts versus fields that describe metadata about the assertion.

Every slot on Association inherits from association slot. The qualifier slots form a clean sub-hierarchy under qualifier:

association slot
├── qualifier                              ← IDENTITY (all descendants)
│   ├── form or variant qualifier
│   │   ├── subject form or variant qualifier
│   │   └── object form or variant qualifier
│   ├── aspect qualifier / direction qualifier / context qualifier / ...
│   │   └── (subject/object variants of each)
│   ├── statement qualifier
│   │   ├── causal mechanism qualifier
│   │   ├── anatomical context qualifier
│   │   └── species context qualifier
│   ├── qualified predicate
│   ├── sex qualifier / onset qualifier / frequency qualifier
│   └── ...
│
├── publications                           ← METADATA (not under qualifier)
├── has evidence                           ← METADATA
├── knowledge source                       ← METADATA
│   ├── primary knowledge source
│   └── aggregator knowledge source
├── sources                                ← METADATA
├── subject closure / object closure       ← METADATA (denormalized)
└── ...

Every qualifier slot has qualifier as a transitive is_a ancestor. No metadata/evidence slot does. Zero exceptions across the entire schema. This means SchemaView.slot_ancestors() can programmatically classify each slot at generation time:

for slot_name in sv.class_slots("association"):
    ancestors = sv.slot_ancestors(slot_name, reflexive=True)
    if "qualifier" in ancestors:
        # This is a qualifier slot → include in identity hash

This automatically adapts per subclass — each Association descendant inherits a different set of qualifier slots from its mixins and parent classes. The generator walks the hierarchy once, producing a per-class slot map with the right fields for each of the 104 classes. No hardcoded lists, no new schema annotations needed.

Putting it together

The generator precomputes two slot maps at generation time and injects them as dict literals into the generated module. Both maps have one entry per Association class (104 classes total), but they differ in which fields each entry contains:

  • _QUALIFIER_IDENTITY_SLOTS — each class maps to SPO + qualifiers + KL + AT + primary_knowledge_source (8–24 fields per class depending on qualifier count)
  • _STANDARD_IDENTITY_SLOTS — same as above plus publications (9–25 fields per class)

Generation-time vs runtime boundary. SchemaView and linkml_runtime are generation-time only dependencies — they run once when biolink_pydantic_generator.py produces pydanticmodel_v2.py, not when Association classes are instantiated. The slot maps above are emitted as plain Python dict literals in the generated module; there is no schema parsing, no YAML loading, and no linkml_runtime import at runtime. The only runtime imports the feature adds are stdlib (hashlib, json, enum) and pydantic.model_validator. This is a key design property: the generated module is self-contained and has zero additional runtime dependencies beyond what LinkML's Pydantic output already requires.

At runtime, the DeterministicIdMixin's model_validator(mode='before') checks AssociationIdConfig.strategy, looks up the appropriate slot map (or computes ALL_FIELDS/CUSTOM dynamically), builds a canonical string from the field values, and produces a SHA-256 hash. The ALL_FIELDS and CUSTOM strategies don't need precomputed maps — they resolve fields at runtime from cls.model_fields.

Runtime-configurable strategies

Different downstream projects have different needs for what constitutes "the same association." Four strategies are available, configured once at process startup:

Strategy Fields in hash Use case
SPO_Q_KLAT_PKS (default) SPO + qualifiers + KL + AT + primary_knowledge_source + publications Matches previous community work (SPO+Q+KL+AT+pubs). Good balance of deduplication and provenance tracking.
QUALIFIER_BASED SPO + qualifiers + KL + AT + primary_knowledge_source (no pubs) Maximum deduplication — same assertion from different papers gets one ID.
ALL_FIELDS Every field on the class except id Maximum granularity — any metadata difference produces a distinct ID.
CUSTOM User-specified field list Full control for project-specific needs.

All strategies include the class name as a discriminator, so a GeneToDiseaseAssociation and a plain Association with identical field values get different IDs.

from biolink_model.datamodel.pydanticmodel_v2 import (
    Association, AssociationIdConfig, IdStrategy,
)

# Set once at process startup — applies to all Association instantiations
AssociationIdConfig.strategy = IdStrategy.SPO_Q_KLAT_PKS    # default
AssociationIdConfig.strategy = IdStrategy.QUALIFIER_BASED    # no publications
AssociationIdConfig.strategy = IdStrategy.ALL_FIELDS         # hash everything
AssociationIdConfig.strategy = IdStrategy.CUSTOM             # user-defined
AssociationIdConfig.custom_fields = ["subject", "predicate", "object", "publications"]

Identity field composition

For the curated strategies (QUALIFIER_BASED and SPO_Q_KLAT_PKS), each field in the hash comes from a specific source:

Category Fields Source
Core triple subject, predicate, object Fixed list
Statement modifier negated Fixed list
Required metadata knowledge_level, agent_type Fixed list
Provenance primary_knowledge_source Fixed list
All qualifiers Varies per subclass (1–17 slots) Discovered via slot_ancestors() at generation time
Publications publications Fixed list (SPO_Q_KLAT_PKS only)
Class discriminator cls.__name__ Built into hash string

Auto-discovered qualifier fields per subclass

The generator automatically discovers qualifier slots for all 104 Association classes. Representative examples:

Class Qualifier slots Total identity fields (default strategy)
Association 1 (qualifier) 9
GeneToDiseaseAssociation 5 (object_direction_qualifier, qualified_predicate, qualifier, subject_aspect_qualifier, subject_form_or_variant_qualifier) 13
DiseaseToPhenotypicFeatureAssociation 10 (includes onset_qualifier, sex_qualifier, frequency_qualifier, disease_context_qualifier, ...) 18
ChemicalAffectsGeneAssociation 16 (includes anatomical_context_qualifier, causal_mechanism_qualifier, species_context_qualifier, ...) 24
GeneAffectsChemicalAssociation 17 (adds object_derivative_qualifier) 25

Usage examples

Default behavior (SPO+Q+KL+AT+pubs)

from biolink_model.datamodel.pydanticmodel_v2 import (
    Association, AssociationIdConfig, IdStrategy
)

# Just construct — no id needed, default strategy applies
assoc = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
    primary_knowledge_source="infores:monarchinitiative",
    publications=["PMID:12345"],
)
print(assoc.id)
# → "uuid:a3f8b2c1d4e5f6..."  (deterministic SHA-256 hex)

# Same inputs → same ID (always)
assoc2 = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
    primary_knowledge_source="infores:monarchinitiative",
    publications=["PMID:12345"],
    has_evidence=["ECO:0000304"],          # ← NOT in hash (metadata)
    aggregator_knowledge_source=["infores:aggregator"],  # ← NOT in hash
)
assert assoc.id == assoc2.id  # True! Evidence/aggregator don't affect ID

# Explicit ID still works
assoc3 = Association(
    id="my:custom-id",
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
)
assert assoc3.id == "my:custom-id"  # Preserved

Switching to ALL_FIELDS (hash everything)

# Set once at process startup
AssociationIdConfig.strategy = IdStrategy.ALL_FIELDS

assoc = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
    primary_knowledge_source="infores:monarchinitiative",
    publications=["PMID:12345"],
    has_evidence=["ECO:0000304"],
    aggregator_knowledge_source=["infores:aggregator"],
)

# Now ANY field change → different ID, including evidence
assoc2 = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
    primary_knowledge_source="infores:monarchinitiative",
    publications=["PMID:12345"],
    has_evidence=["ECO:0000501"],          # different evidence
    aggregator_knowledge_source=["infores:aggregator"],
)
assert assoc.id != assoc2.id  # Different — everything is in the hash

Custom field set

# Minimal SPO-only hashing
AssociationIdConfig.strategy = IdStrategy.CUSTOM
AssociationIdConfig.custom_fields = ["subject", "predicate", "object"]

assoc = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
    primary_knowledge_source="infores:monarchinitiative",
)
assoc2 = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="knowledge_assertion",  # different KL
    agent_type="manual_agent",              # different AT
    primary_knowledge_source="infores:other",  # different source
)
assert assoc.id == assoc2.id  # Same — only SPO matters

Design tradeoffs

Concern Analysis
Hash format Full SHA-256 hex (64 chars) with uuid: prefix. Full length avoids collision risk.
Module-level global state AssociationIdConfig is process-global. Fine for most pipelines (set once at startup), but not suitable if you need two strategies simultaneously in the same process.
Thread safety If multiple threads use different strategies, the global config creates a race. A future enhancement could use Pydantic v2's model_validate(data, context={"strategy": ...}) — but that only works with model_validate(), not the Association(...) constructor.
"Hash everything" reproducibility If the schema adds a new slot in a future release, ALL_FIELDS hashes change for the same data. The curated presets (SPO_Q_KLAT_PKS, QUALIFIER_BASED) are version-stable.
Default strategy SPO_Q_KLAT_PKS is the default because it matches previous community work. Projects that want maximum deduplication safety can switch to QUALIFIER_BASED; projects that want maximum granularity can use ALL_FIELDS.
SHA-256 vs UUID5 SHA-256 was chosen over UUID5 (RFC 4122) because UUID5 is internally SHA-1 (weaker). The uuid: prefix provides namespace clarity without constraining the hash algorithm.

Future work: unique_keys as the authoritative source

LinkML supports unique_keys — a schema-level declaration that a tuple of slots uniquely identifies a class instance (mapped to owl:hasKey in OWL). For example:

GeneToDiseaseAssociation:
  unique_keys:
    identity:
      unique_key_slots:
        - subject
        - predicate
        - object
        - subject_aspect_qualifier
        - subject_form_or_variant_qualifier
        - object_direction_qualifier
        - qualified_predicate

Biolink currently has no unique_keys declarations on any Association class. This PR uses the slot is_a hierarchy as a pragmatic alternative — it already encodes the right information and requires zero new annotations across 104 classes.

However, unique_keys would be the more principled long-term approach:

  • Schema-authoritative: The identity semantics would live in biolink-model.yaml itself, not in generator logic. Other generators (JSON Schema, SQL DDL, SHACL) could consume the same declarations.
  • Per-class precision: Some subclasses may want identity definitions that don't follow the qualifier hierarchy exactly. unique_keys allows per-class overrides.
  • Validation beyond ID generation: unique_keys can drive uniqueness constraints in databases, SPARQL shapes, and other downstream artifacts — not just Pydantic ID hashing.

A natural evolution would be to add unique_keys declarations to Association subclasses over time, and have the generator prefer unique_keys where declared, falling back to the slot is_a hierarchy discovery for classes that don't have them yet. The runtime strategies (ALL_FIELDS, CUSTOM) would remain available regardless.

Files changed

File Change
scripts/biolink_pydantic_generator.py New — Custom generator that discovers qualifier slots via SchemaView.slot_ancestors() and injects IdStrategy, AssociationIdConfig, precomputed slot maps, and DeterministicIdMixin
Makefile 1-line edit — Use custom generator instead of gen-pydantic
src/biolink_model/datamodel/pydanticmodel_v2.py Regenerated — Gains injected classes and modified Association(DeterministicIdMixin, Entity) declaration
tests/test_deterministic_ids.py New — 12 tests covering core behavior and all 4 strategies

When constructing Association instances (and all ~75 subclasses) without
an explicit id, a deterministic SHA-256 hash is now generated from
identity-defining fields. Same inputs always produce the same ID,
solving reproducibility, deduplication, and data integration issues.

Uses the slot is_a hierarchy to automatically discover qualifier fields
per subclass — no hardcoded lists or schema annotations needed.

Supports four runtime-configurable strategies:
- SPO_Q_KLAT_PKS (default): SPO + qualifiers + KL + AT + PKS + pubs
- QUALIFIER_BASED: same but without publications
- ALL_FIELDS: every field except id
- CUSTOM: user-specified field list

Closes #1707

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kevinschaper kevinschaper marked this pull request as draft February 18, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Association instances require manually-supplied IDs, producing non-deterministic identifiers

2 participants