Add deterministic hash-based IDs for Association classes by kevinschaper · Pull Request #1708 · biolink/biolink-model

kevinschaper · 2026-02-18T00:31:51Z

Summary

Association instances (and all ~75 subclasses) currently require a manually-supplied id, leading to non-deterministic identifiers — random UUIDs, placeholder values like id=2 (see #1507), or project-specific hashing scattered across codebases. This PR adds automatic deterministic ID generation: when no id is provided, a SHA-256 hash is computed from the association's identity-defining fields. Same inputs always produce the same ID. Explicitly-supplied IDs are preserved unchanged.

How it works

This feature sits at the intersection of three design decisions — in LinkML's generator architecture, in Pydantic v2's validation model, and in biolink's own schema structure — that together make the implementation clean and maintainable.

Layer 1: LinkML's PydanticGenerator was built to be extended

LinkML's PydanticGenerator is a @dataclass that inherits from OOCodeGenerator and LifecycleMixin. It was explicitly designed for subclassing: downstream projects can create a @dataclass subclass that inherits all parent fields and overrides specific lifecycle hooks without needing to copy-paste or monkey-patch generator internals.

The extension points we use:

injected_classes — a list of Python classes (as source strings or live types) that get merged into the generated module alongside the default LinkMLMeta. We inject IdStrategy, AssociationIdConfig, the two precomputed slot maps, and DeterministicIdMixin.
imports — an Imports container that merges with the generator's default imports (which already include pydantic.BaseModel, typing, etc.). We add hashlib, json, enum, and pydantic.model_validator.
after_render_template — a LifecycleMixin hook that receives the final serialized Python source as a string, after Jinja2 rendering and Black formatting. We use it to splice DeterministicIdMixin into Association's base classes: class Association(Entity): → class Association(DeterministicIdMixin, Entity):.

Because PydanticGenerator is a dataclass, our BiolinkPydanticGenerator subclass can do all its work in __post_init__ — computing slot maps from the live SchemaView, populating injected_classes and imports, and then letting the parent's serialize() handle the rest. The gen-pydantic CLI hardcodes PydanticGenerator, so the Makefile calls our subclass directly via python scripts/biolink_pydantic_generator.py instead.

Layer 2: Pydantic v2's `model_validator(mode='before')` gives us a clean hook

The challenge: id is a required field on Association. If a user omits it, Pydantic's field validation will reject the input before any custom logic can run. Pydantic v2's model_validator(mode='before') solves this precisely:

It runs before any field validation or type coercion, receiving the raw input as a plain dict
The validator can inspect the dict, compute a deterministic ID from the other fields, and insert data["id"] — all before Pydantic checks whether id is present
If the user did supply an id, the validator detects it and passes through unchanged
Crucially, mode='before' validators are inherited by subclasses via Pydantic's MRO. Adding the validator to Association automatically covers all 104 descendant classes — GeneToDiseaseAssociation, ChemicalAffectsGeneAssociation, etc. — with zero per-class changes. Pydantic's DecoratorInfos.build() walks the MRO to collect validators from base classes and bind them to the current class.

The guard if isinstance(data, dict) is important: when models appear in Union discriminators, Pydantic may pass an already-instantiated object to the validator rather than a dict.

Layer 3: Biolink's slot `is_a` hierarchy already encodes identity semantics

The simplest approach would be to hash all fields (ALL_FIELDS strategy), and indeed this PR supports that. But the biolink schema already encodes a more sophisticated distinction between fields that define what an association asserts versus fields that describe metadata about the assertion.

Every slot on Association inherits from association slot. The qualifier slots form a clean sub-hierarchy under qualifier:

association slot
├── qualifier                              ← IDENTITY (all descendants)
│   ├── form or variant qualifier
│   │   ├── subject form or variant qualifier
│   │   └── object form or variant qualifier
│   ├── aspect qualifier / direction qualifier / context qualifier / ...
│   │   └── (subject/object variants of each)
│   ├── statement qualifier
│   │   ├── causal mechanism qualifier
│   │   ├── anatomical context qualifier
│   │   └── species context qualifier
│   ├── qualified predicate
│   ├── sex qualifier / onset qualifier / frequency qualifier
│   └── ...
│
├── publications                           ← METADATA (not under qualifier)
├── has evidence                           ← METADATA
├── knowledge source                       ← METADATA
│   ├── primary knowledge source
│   └── aggregator knowledge source
├── sources                                ← METADATA
├── subject closure / object closure       ← METADATA (denormalized)
└── ...

Every qualifier slot has qualifier as a transitive is_a ancestor. No metadata/evidence slot does. Zero exceptions across the entire schema. This means SchemaView.slot_ancestors() can programmatically classify each slot at generation time:

for slot_name in sv.class_slots("association"):
    ancestors = sv.slot_ancestors(slot_name, reflexive=True)
    if "qualifier" in ancestors:
        # This is a qualifier slot → include in identity hash

This automatically adapts per subclass — each Association descendant inherits a different set of qualifier slots from its mixins and parent classes. The generator walks the hierarchy once, producing a per-class slot map with the right fields for each of the 104 classes. No hardcoded lists, no new schema annotations needed.

Putting it together

The generator precomputes two slot maps at generation time and injects them as dict literals into the generated module. Both maps have one entry per Association class (104 classes total), but they differ in which fields each entry contains:

_QUALIFIER_IDENTITY_SLOTS — each class maps to SPO + qualifiers + KL + AT + primary_knowledge_source (8–24 fields per class depending on qualifier count)
_STANDARD_IDENTITY_SLOTS — same as above plus publications (9–25 fields per class)

Generation-time vs runtime boundary. SchemaView and linkml_runtime are generation-time only dependencies — they run once when biolink_pydantic_generator.py produces pydanticmodel_v2.py, not when Association classes are instantiated. The slot maps above are emitted as plain Python dict literals in the generated module; there is no schema parsing, no YAML loading, and no linkml_runtime import at runtime. The only runtime imports the feature adds are stdlib (hashlib, json, enum) and pydantic.model_validator. This is a key design property: the generated module is self-contained and has zero additional runtime dependencies beyond what LinkML's Pydantic output already requires.

At runtime, the DeterministicIdMixin's model_validator(mode='before') checks AssociationIdConfig.strategy, looks up the appropriate slot map (or computes ALL_FIELDS/CUSTOM dynamically), builds a canonical string from the field values, and produces a SHA-256 hash. The ALL_FIELDS and CUSTOM strategies don't need precomputed maps — they resolve fields at runtime from cls.model_fields.

Runtime-configurable strategies

Different downstream projects have different needs for what constitutes "the same association." Four strategies are available, configured once at process startup:

Strategy	Fields in hash	Use case
`SPO_Q_KLAT_PKS` (default)	SPO + qualifiers + KL + AT + primary_knowledge_source + publications	Matches previous community work (SPO+Q+KL+AT+pubs). Good balance of deduplication and provenance tracking.
`QUALIFIER_BASED`	SPO + qualifiers + KL + AT + primary_knowledge_source (no pubs)	Maximum deduplication — same assertion from different papers gets one ID.
`ALL_FIELDS`	Every field on the class except `id`	Maximum granularity — any metadata difference produces a distinct ID.
`CUSTOM`	User-specified field list	Full control for project-specific needs.

All strategies include the class name as a discriminator, so a GeneToDiseaseAssociation and a plain Association with identical field values get different IDs.

from biolink_model.datamodel.pydanticmodel_v2 import (
    Association, AssociationIdConfig, IdStrategy,
)

# Set once at process startup — applies to all Association instantiations
AssociationIdConfig.strategy = IdStrategy.SPO_Q_KLAT_PKS    # default
AssociationIdConfig.strategy = IdStrategy.QUALIFIER_BASED    # no publications
AssociationIdConfig.strategy = IdStrategy.ALL_FIELDS         # hash everything
AssociationIdConfig.strategy = IdStrategy.CUSTOM             # user-defined
AssociationIdConfig.custom_fields = ["subject", "predicate", "object", "publications"]

Identity field composition

For the curated strategies (QUALIFIER_BASED and SPO_Q_KLAT_PKS), each field in the hash comes from a specific source:

Category	Fields	Source
Core triple	`subject`, `predicate`, `object`	Fixed list
Statement modifier	`negated`	Fixed list
Required metadata	`knowledge_level`, `agent_type`	Fixed list
Provenance	`primary_knowledge_source`	Fixed list
All qualifiers	Varies per subclass (1–17 slots)	Discovered via `slot_ancestors()` at generation time
Publications	`publications`	Fixed list (`SPO_Q_KLAT_PKS` only)
Class discriminator	`cls.__name__`	Built into hash string

Auto-discovered qualifier fields per subclass

The generator automatically discovers qualifier slots for all 104 Association classes. Representative examples:

Class	Qualifier slots	Total identity fields (default strategy)
`Association`	1 (`qualifier`)	9
`GeneToDiseaseAssociation`	5 (`object_direction_qualifier`, `qualified_predicate`, `qualifier`, `subject_aspect_qualifier`, `subject_form_or_variant_qualifier`)	13
`DiseaseToPhenotypicFeatureAssociation`	10 (includes `onset_qualifier`, `sex_qualifier`, `frequency_qualifier`, `disease_context_qualifier`, ...)	18
`ChemicalAffectsGeneAssociation`	16 (includes `anatomical_context_qualifier`, `causal_mechanism_qualifier`, `species_context_qualifier`, ...)	24
`GeneAffectsChemicalAssociation`	17 (adds `object_derivative_qualifier`)	25

Usage examples

Default behavior (SPO+Q+KL+AT+pubs)

from biolink_model.datamodel.pydanticmodel_v2 import (
    Association, AssociationIdConfig, IdStrategy
)

# Just construct — no id needed, default strategy applies
assoc = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
    primary_knowledge_source="infores:monarchinitiative",
    publications=["PMID:12345"],
)
print(assoc.id)
# → "uuid:a3f8b2c1d4e5f6..."  (deterministic SHA-256 hex)

# Same inputs → same ID (always)
assoc2 = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
    primary_knowledge_source="infores:monarchinitiative",
    publications=["PMID:12345"],
    has_evidence=["ECO:0000304"],          # ← NOT in hash (metadata)
    aggregator_knowledge_source=["infores:aggregator"],  # ← NOT in hash
)
assert assoc.id == assoc2.id  # True! Evidence/aggregator don't affect ID

# Explicit ID still works
assoc3 = Association(
    id="my:custom-id",
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
)
assert assoc3.id == "my:custom-id"  # Preserved

Switching to ALL_FIELDS (hash everything)

# Set once at process startup
AssociationIdConfig.strategy = IdStrategy.ALL_FIELDS

assoc = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
    primary_knowledge_source="infores:monarchinitiative",
    publications=["PMID:12345"],
    has_evidence=["ECO:0000304"],
    aggregator_knowledge_source=["infores:aggregator"],
)

# Now ANY field change → different ID, including evidence
assoc2 = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
    primary_knowledge_source="infores:monarchinitiative",
    publications=["PMID:12345"],
    has_evidence=["ECO:0000501"],          # different evidence
    aggregator_knowledge_source=["infores:aggregator"],
)
assert assoc.id != assoc2.id  # Different — everything is in the hash

Custom field set

# Minimal SPO-only hashing
AssociationIdConfig.strategy = IdStrategy.CUSTOM
AssociationIdConfig.custom_fields = ["subject", "predicate", "object"]

assoc = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="not_provided",
    agent_type="not_provided",
    primary_knowledge_source="infores:monarchinitiative",
)
assoc2 = Association(
    subject="HGNC:1234",
    predicate="biolink:related_to",
    object="HP:0001234",
    knowledge_level="knowledge_assertion",  # different KL
    agent_type="manual_agent",              # different AT
    primary_knowledge_source="infores:other",  # different source
)
assert assoc.id == assoc2.id  # Same — only SPO matters

Design tradeoffs

Concern	Analysis
Hash format	Full SHA-256 hex (64 chars) with `uuid:` prefix. Full length avoids collision risk.
Module-level global state	`AssociationIdConfig` is process-global. Fine for most pipelines (set once at startup), but not suitable if you need two strategies simultaneously in the same process.
Thread safety	If multiple threads use different strategies, the global config creates a race. A future enhancement could use Pydantic v2's `model_validate(data, context={"strategy": ...})` — but that only works with `model_validate()`, not the `Association(...)` constructor.
"Hash everything" reproducibility	If the schema adds a new slot in a future release, `ALL_FIELDS` hashes change for the same data. The curated presets (`SPO_Q_KLAT_PKS`, `QUALIFIER_BASED`) are version-stable.
Default strategy	`SPO_Q_KLAT_PKS` is the default because it matches previous community work. Projects that want maximum deduplication safety can switch to `QUALIFIER_BASED`; projects that want maximum granularity can use `ALL_FIELDS`.
SHA-256 vs UUID5	SHA-256 was chosen over UUID5 (RFC 4122) because UUID5 is internally SHA-1 (weaker). The `uuid:` prefix provides namespace clarity without constraining the hash algorithm.

Future work: `unique_keys` as the authoritative source

LinkML supports unique_keys — a schema-level declaration that a tuple of slots uniquely identifies a class instance (mapped to owl:hasKey in OWL). For example:

GeneToDiseaseAssociation:
  unique_keys:
    identity:
      unique_key_slots:
        - subject
        - predicate
        - object
        - subject_aspect_qualifier
        - subject_form_or_variant_qualifier
        - object_direction_qualifier
        - qualified_predicate

Biolink currently has no unique_keys declarations on any Association class. This PR uses the slot is_a hierarchy as a pragmatic alternative — it already encodes the right information and requires zero new annotations across 104 classes.

However, unique_keys would be the more principled long-term approach:

Schema-authoritative: The identity semantics would live in biolink-model.yaml itself, not in generator logic. Other generators (JSON Schema, SQL DDL, SHACL) could consume the same declarations.
Per-class precision: Some subclasses may want identity definitions that don't follow the qualifier hierarchy exactly. unique_keys allows per-class overrides.
Validation beyond ID generation: unique_keys can drive uniqueness constraints in databases, SPARQL shapes, and other downstream artifacts — not just Pydantic ID hashing.

A natural evolution would be to add unique_keys declarations to Association subclasses over time, and have the generator prefer unique_keys where declared, falling back to the slot is_a hierarchy discovery for classes that don't have them yet. The runtime strategies (ALL_FIELDS, CUSTOM) would remain available regardless.

Files changed

File	Change
`scripts/biolink_pydantic_generator.py`	New — Custom generator that discovers qualifier slots via `SchemaView.slot_ancestors()` and injects `IdStrategy`, `AssociationIdConfig`, precomputed slot maps, and `DeterministicIdMixin`
`Makefile`	1-line edit — Use custom generator instead of `gen-pydantic`
`src/biolink_model/datamodel/pydanticmodel_v2.py`	Regenerated — Gains injected classes and modified `Association(DeterministicIdMixin, Entity)` declaration
`tests/test_deterministic_ids.py`	New — 12 tests covering core behavior and all 4 strategies

When constructing Association instances (and all ~75 subclasses) without an explicit id, a deterministic SHA-256 hash is now generated from identity-defining fields. Same inputs always produce the same ID, solving reproducibility, deduplication, and data integration issues. Uses the slot is_a hierarchy to automatically discover qualifier fields per subclass — no hardcoded lists or schema annotations needed. Supports four runtime-configurable strategies: - SPO_Q_KLAT_PKS (default): SPO + qualifiers + KL + AT + PKS + pubs - QUALIFIER_BASED: same but without publications - ALL_FIELDS: every field except id - CUSTOM: user-specified field list Closes #1707 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sierra-moxon approved these changes Feb 18, 2026

View reviewed changes

kevinschaper marked this pull request as draft February 18, 2026 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add deterministic hash-based IDs for Association classes#1708

Add deterministic hash-based IDs for Association classes#1708
kevinschaper wants to merge 1 commit intomasterfrom
issue-1707-deterministic-association-ids

kevinschaper commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

kevinschaper commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Layer 1: LinkML's PydanticGenerator was built to be extended

Layer 2: Pydantic v2's model_validator(mode='before') gives us a clean hook

Layer 3: Biolink's slot is_a hierarchy already encodes identity semantics

Putting it together

Runtime-configurable strategies

Identity field composition

Auto-discovered qualifier fields per subclass

Usage examples

Default behavior (SPO+Q+KL+AT+pubs)

Switching to ALL_FIELDS (hash everything)

Custom field set

Design tradeoffs

Future work: unique_keys as the authoritative source

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinschaper commented Feb 18, 2026 •

edited

Loading

Layer 2: Pydantic v2's `model_validator(mode='before')` gives us a clean hook

Layer 3: Biolink's slot `is_a` hierarchy already encodes identity semantics

Future work: `unique_keys` as the authoritative source