Add deterministic hash-based IDs for Association classes#1708
Draft
kevinschaper wants to merge 1 commit intomasterfrom
Draft
Add deterministic hash-based IDs for Association classes#1708kevinschaper wants to merge 1 commit intomasterfrom
kevinschaper wants to merge 1 commit intomasterfrom
Conversation
When constructing Association instances (and all ~75 subclasses) without an explicit id, a deterministic SHA-256 hash is now generated from identity-defining fields. Same inputs always produce the same ID, solving reproducibility, deduplication, and data integration issues. Uses the slot is_a hierarchy to automatically discover qualifier fields per subclass — no hardcoded lists or schema annotations needed. Supports four runtime-configurable strategies: - SPO_Q_KLAT_PKS (default): SPO + qualifiers + KL + AT + PKS + pubs - QUALIFIER_BASED: same but without publications - ALL_FIELDS: every field except id - CUSTOM: user-specified field list Closes #1707 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sierra-moxon
approved these changes
Feb 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #1707
Association instances (and all ~75 subclasses) currently require a manually-supplied
id, leading to non-deterministic identifiers — random UUIDs, placeholder values likeid=2(see #1507), or project-specific hashing scattered across codebases. This PR adds automatic deterministic ID generation: when noidis provided, a SHA-256 hash is computed from the association's identity-defining fields. Same inputs always produce the same ID. Explicitly-supplied IDs are preserved unchanged.How it works
This feature sits at the intersection of three design decisions — in LinkML's generator architecture, in Pydantic v2's validation model, and in biolink's own schema structure — that together make the implementation clean and maintainable.
Layer 1: LinkML's PydanticGenerator was built to be extended
LinkML's
PydanticGeneratoris a@dataclassthat inherits fromOOCodeGeneratorandLifecycleMixin. It was explicitly designed for subclassing: downstream projects can create a@dataclasssubclass that inherits all parent fields and overrides specific lifecycle hooks without needing to copy-paste or monkey-patch generator internals.The extension points we use:
injected_classes— a list of Python classes (as source strings or live types) that get merged into the generated module alongside the defaultLinkMLMeta. We injectIdStrategy,AssociationIdConfig, the two precomputed slot maps, andDeterministicIdMixin.imports— anImportscontainer that merges with the generator's default imports (which already includepydantic.BaseModel,typing, etc.). We addhashlib,json,enum, andpydantic.model_validator.after_render_template— aLifecycleMixinhook that receives the final serialized Python source as a string, after Jinja2 rendering and Black formatting. We use it to spliceDeterministicIdMixinintoAssociation's base classes:class Association(Entity):→class Association(DeterministicIdMixin, Entity):.Because
PydanticGeneratoris a dataclass, ourBiolinkPydanticGeneratorsubclass can do all its work in__post_init__— computing slot maps from the liveSchemaView, populatinginjected_classesandimports, and then letting the parent'sserialize()handle the rest. Thegen-pydanticCLI hardcodesPydanticGenerator, so the Makefile calls our subclass directly viapython scripts/biolink_pydantic_generator.pyinstead.Layer 2: Pydantic v2's
model_validator(mode='before')gives us a clean hookThe challenge:
idis a required field onAssociation. If a user omits it, Pydantic's field validation will reject the input before any custom logic can run. Pydantic v2'smodel_validator(mode='before')solves this precisely:dictdata["id"]— all before Pydantic checks whetheridis presentid, the validator detects it and passes through unchangedmode='before'validators are inherited by subclasses via Pydantic's MRO. Adding the validator toAssociationautomatically covers all 104 descendant classes —GeneToDiseaseAssociation,ChemicalAffectsGeneAssociation, etc. — with zero per-class changes. Pydantic'sDecoratorInfos.build()walks the MRO to collect validators from base classes and bind them to the current class.The guard
if isinstance(data, dict)is important: when models appear inUniondiscriminators, Pydantic may pass an already-instantiated object to the validator rather than a dict.Layer 3: Biolink's slot
is_ahierarchy already encodes identity semanticsThe simplest approach would be to hash all fields (
ALL_FIELDSstrategy), and indeed this PR supports that. But the biolink schema already encodes a more sophisticated distinction between fields that define what an association asserts versus fields that describe metadata about the assertion.Every slot on
Associationinherits fromassociation slot. The qualifier slots form a clean sub-hierarchy underqualifier:Every qualifier slot has
qualifieras a transitiveis_aancestor. No metadata/evidence slot does. Zero exceptions across the entire schema. This meansSchemaView.slot_ancestors()can programmatically classify each slot at generation time:This automatically adapts per subclass — each Association descendant inherits a different set of qualifier slots from its mixins and parent classes. The generator walks the hierarchy once, producing a per-class slot map with the right fields for each of the 104 classes. No hardcoded lists, no new schema annotations needed.
Putting it together
The generator precomputes two slot maps at generation time and injects them as dict literals into the generated module. Both maps have one entry per Association class (104 classes total), but they differ in which fields each entry contains:
_QUALIFIER_IDENTITY_SLOTS— each class maps to SPO + qualifiers + KL + AT + primary_knowledge_source (8–24 fields per class depending on qualifier count)_STANDARD_IDENTITY_SLOTS— same as above pluspublications(9–25 fields per class)Generation-time vs runtime boundary.
SchemaViewandlinkml_runtimeare generation-time only dependencies — they run once whenbiolink_pydantic_generator.pyproducespydanticmodel_v2.py, not when Association classes are instantiated. The slot maps above are emitted as plain Python dict literals in the generated module; there is no schema parsing, no YAML loading, and nolinkml_runtimeimport at runtime. The only runtime imports the feature adds are stdlib (hashlib,json,enum) andpydantic.model_validator. This is a key design property: the generated module is self-contained and has zero additional runtime dependencies beyond what LinkML's Pydantic output already requires.At runtime, the
DeterministicIdMixin'smodel_validator(mode='before')checksAssociationIdConfig.strategy, looks up the appropriate slot map (or computesALL_FIELDS/CUSTOMdynamically), builds a canonical string from the field values, and produces a SHA-256 hash. TheALL_FIELDSandCUSTOMstrategies don't need precomputed maps — they resolve fields at runtime fromcls.model_fields.Runtime-configurable strategies
Different downstream projects have different needs for what constitutes "the same association." Four strategies are available, configured once at process startup:
SPO_Q_KLAT_PKS(default)QUALIFIER_BASEDALL_FIELDSidCUSTOMAll strategies include the class name as a discriminator, so a
GeneToDiseaseAssociationand a plainAssociationwith identical field values get different IDs.Identity field composition
For the curated strategies (
QUALIFIER_BASEDandSPO_Q_KLAT_PKS), each field in the hash comes from a specific source:subject,predicate,objectnegatedknowledge_level,agent_typeprimary_knowledge_sourceslot_ancestors()at generation timepublicationsSPO_Q_KLAT_PKSonly)cls.__name__Auto-discovered qualifier fields per subclass
The generator automatically discovers qualifier slots for all 104 Association classes. Representative examples:
Associationqualifier)GeneToDiseaseAssociationobject_direction_qualifier,qualified_predicate,qualifier,subject_aspect_qualifier,subject_form_or_variant_qualifier)DiseaseToPhenotypicFeatureAssociationonset_qualifier,sex_qualifier,frequency_qualifier,disease_context_qualifier, ...)ChemicalAffectsGeneAssociationanatomical_context_qualifier,causal_mechanism_qualifier,species_context_qualifier, ...)GeneAffectsChemicalAssociationobject_derivative_qualifier)Usage examples
Default behavior (SPO+Q+KL+AT+pubs)
Switching to ALL_FIELDS (hash everything)
Custom field set
Design tradeoffs
uuid:prefix. Full length avoids collision risk.AssociationIdConfigis process-global. Fine for most pipelines (set once at startup), but not suitable if you need two strategies simultaneously in the same process.model_validate(data, context={"strategy": ...})— but that only works withmodel_validate(), not theAssociation(...)constructor.ALL_FIELDShashes change for the same data. The curated presets (SPO_Q_KLAT_PKS,QUALIFIER_BASED) are version-stable.SPO_Q_KLAT_PKSis the default because it matches previous community work. Projects that want maximum deduplication safety can switch toQUALIFIER_BASED; projects that want maximum granularity can useALL_FIELDS.uuid:prefix provides namespace clarity without constraining the hash algorithm.Future work:
unique_keysas the authoritative sourceLinkML supports
unique_keys— a schema-level declaration that a tuple of slots uniquely identifies a class instance (mapped toowl:hasKeyin OWL). For example:Biolink currently has no
unique_keysdeclarations on any Association class. This PR uses the slotis_ahierarchy as a pragmatic alternative — it already encodes the right information and requires zero new annotations across 104 classes.However,
unique_keyswould be the more principled long-term approach:biolink-model.yamlitself, not in generator logic. Other generators (JSON Schema, SQL DDL, SHACL) could consume the same declarations.unique_keysallows per-class overrides.unique_keyscan drive uniqueness constraints in databases, SPARQL shapes, and other downstream artifacts — not just Pydantic ID hashing.A natural evolution would be to add
unique_keysdeclarations to Association subclasses over time, and have the generator preferunique_keyswhere declared, falling back to the slotis_ahierarchy discovery for classes that don't have them yet. The runtime strategies (ALL_FIELDS,CUSTOM) would remain available regardless.Files changed
scripts/biolink_pydantic_generator.pySchemaView.slot_ancestors()and injectsIdStrategy,AssociationIdConfig, precomputed slot maps, andDeterministicIdMixinMakefilegen-pydanticsrc/biolink_model/datamodel/pydanticmodel_v2.pyAssociation(DeterministicIdMixin, Entity)declarationtests/test_deterministic_ids.py