feat(checkpoint): state migration for checkpoints (proposal 0014)#46
Conversation
…rface Implements pipeline-utilities spec §10.12 (proposal 0014). - New errors: CheckpointStateMigrationMissing, CheckpointStateMigrationFailed. Both non-transient per §10.10. The missing-chain error carries from_version / to_version / registered_migrations_count / registry_description for actionable operator diagnostics. - New types: StateMigration (frozen dataclass — from_version, to_version, migrate callable) and MigrationRegistry (BFS chain resolution + ambiguity detection per §10.12.2). - Multi-shortest-path detection: when BFS finds a shortest path AND a second distinct path of equal length exists, the registry raises ValueError per the spec's ambiguous-chain rule. Resume surfaces this as CheckpointStateMigrationMissing with the ambiguity description in the payload. - State.schema_version: ClassVar[str] = '' (per spec §10.2's per-language carve-out). Empty-string sentinel; the framework reads type(state).schema_version at save time. - Checkpointer Protocol: supports_state_migration: ClassVar[bool] marker per §10.12.1. InMemoryCheckpointer: False (typed in- memory references can't expose a class-independent intermediate). SQLiteCheckpointer: True in JSON mode, False in pickle mode (pickle holds class identity and round-trips to typed instances; can't bridge versions). - GraphBuilder.with_state_migration / with_state_migrations thread a populated MigrationRegistry into CompiledGraph at compile time. - Resume-path routing (compiled.py): version mismatch → unsupported-backend check → registry lookup → chain application (with per-migration failure wrap) → final deserialization. The post-migration deserialization failure still surfaces as CheckpointRecordInvalid per §10.12.4; pre-migration version mismatch routes through the new two categories. Order matters; documented inline so a future reader doesn't swap it back. - Parent-state migration: same chain applied to each entry of parent_states in lockstep with the outer state per §10.12.2. Code comment records the spec-mandated equivalence so future contributors don't add per-parent metadata without a follow-on proposal. - Drop the CHECKPOINT_SCHEMA_VERSION = '1' constant: per Q1 spec answer, the old backend-internal record-shape role had no spec slot anyway. SQLiteCheckpointer no longer rejects records with non-default versions on load — that routing is now the engine's concern at resume time. Existing records carrying schema_version='1' get reinterpreted as user-facing v1 identifiers (single-user dev, no compat shim needed per Chris's note).
New tests/conformance/test_state_migration.py drives all 8 spec state-migration fixtures end-to-end against the real engine (SQLite JSON-mode backend, real graph compile, real resume path). Harness pieces: - Migration mock library: add_new_field_default, add_v2_field, add_v3_field, identity_passthrough, raises_keyerror, should_not_run, irrelevant. Each fixture's migrate: <name> resolves through the library. - _MigrationTrace wraps each mock to capture invocation order for the migrations_run / migration_count / single_migration_invocation / migration_order_matches_chain assertions. Consecutive duplicates collapse (fixture 043 runs each step once for outer + once for each parent under the lockstep ordering; the assertion is per-step, not per-entity). - _seed_record persists a checkpoint matching the fixture's seeded_record: block before invoke(resume_invocation=...) so the resume path has data to load. Harness/model adjustments: - StateSchema in tests/conformance/harness/directives.py gains optional schema_version (default '') and the required field knob (no-default Pydantic field, used by fixture 044's required_v2_field deserialization-failure case). - _DEFERRED_FIXTURES in test_fixture_parsing.py loses the 039-046 rows; the CasesFixture model parses them via permissive extras on CaseSpec. - Initial-state construction in the resume path uses state_cls.model_construct() so fixtures with required fields (044) can pass a placeholder past Pydantic's validator before the resume even starts; the engine loads state from the checkpoint, not from the placeholder. Protocol-attribute shape: - Checkpointer.supports_state_migration declared as bool = False (not ClassVar) so SQLiteCheckpointer can set the value per-instance in __init__ based on serialization mode. Backends with a static answer (InMemoryCheckpointer) override at the class level with bool = False — Pyright accepts either shape because Protocol attribute conformance ignores the ClassVar marker on subclasses.
… + CHANGELOG tests/unit/test_state_migration.py (16 tests) covers gaps the conformance fixtures don't exercise directly: BFS edge cases on the registry, multi-shortest-path ambiguity detection, GraphBuilder ergonomics (singular + plural registration, duplicate-pair ValueError), and error attribute carriage including __cause__ preservation on CheckpointStateMigrationFailed. docs/concepts/checkpointing.md gains a State migrations section covering: the State.schema_version declaration, the registration surface (with_state_migration / with_state_migrations), BFS chain resolution and ambiguity cases (duplicate edges + multi-shortest- path), the two new error categories and how they relate to CheckpointRecordInvalid (§10.12.4), backend support and why SQLite-pickle / InMemory aren't migration-eligible, the lockstep parent_states migration rule, and the migrations-MUST-be-pure contract. CHANGELOG.md gains two new Added entries (state-migration surface + Checkpointer.supports_state_migration Protocol attribute) plus a Changed entry documenting the CheckpointRecord.schema_version semantic shift and the CHECKPOINT_SCHEMA_VERSION constant removal. Pre-1.0 breaking change covered by the consolidated-release flag.
Spec proposal 0018 (chain-ambiguity category) landed in spec v0.16.0 between PR-4 push and the spec agent's code review. This commit adds the new error category and the required routing swaps; PR-4 stays open as the carrier. - Submodule pinned to v0.16.0; pyproject + __spec_version__ + test_smoke pin assertions follow. - New canonical error class CheckpointStateMigrationChainAmbiguous (and the matching category string) added to checkpoint/errors.py, re-exported from openarmature.checkpoint. Non-transient. Carries optional from_version / to_version when known (always set for the registration-time case; resume-time multi-shortest-path also populates). - MigrationRegistry.register raises CheckpointStateMigrationChainAmbiguous directly at registration time per spec §10.10 (proposal 0018), so the canonical category surfaces at the registration boundary without wrapping. The internal BFS keeps raising ValueError for multi-shortest-path detection; CompiledGraph._migrate_record's except branch now routes that to CheckpointStateMigrationChainAmbiguous at the resume boundary. - Routing precedence on resume per §10.10 (v0.16.0): chain-ambiguous → missing → failed → record-invalid. Code-comment in _migrate_record documents the ordering so a future reader doesn't swap it back. - Code comments record two design-choice notes the spec agent flagged: the load-bearing equal-depth-re-entry behavior in BFS cycle protection (tightening to <= breaks multi-shortest-path detection), and the registration-order ordering of describe() output. - Conformance harness gains the expected_chain_ambiguity_error primitive. Accepts the canonical category at EITHER the case top-level (registration-time detection: registration raises, resume block unreachable) OR inside resume (load-time detection during BFS). Spec §10.12.2's compile-time-SHOULD / load-time-acceptable carve-out lands in the driver as a single try/except wrap around both phases. - Fixture 047 (state-migration-chain-ambiguous) covered; all 9 state-migration fixtures (039-047) pass. - 3 new unit tests for the new category's class shape + attribute carriage; existing duplicate-pair tests updated to assert the canonical category instead of ValueError. - docs/concepts/checkpointing.md updated for the third category + the v0.16.0 routing precedence. CHANGELOG entry updated and Pinned-version line bumped to 0.10.0 → 0.16.0. OTel openarmature.checkpoint.migrate span emission deferred to a follow-on; documented inline in _migrate_record. Spec §6 cross-ref is SHOULD-level, and the emission needs the _InvocationContext that's currently created AFTER the migration path runs. The follow-on can either restructure resume to build the context first or use a side-channel observer dispatch.
Lands the §6 cross-ref piece of proposal 0014 that the previous commit deferred. Versioned resumes whose migration chain runs now emit a zero-duration `openarmature.checkpoint.migrate` span on the OTel observer, parented under the invocation root. - New _MigrationSummary frozen dataclass in graph/compiled.py carries the chain metadata (from_version, to_version, chain_length) out of _migrate_record back to invoke(). - _migrate_record return type becomes tuple[CheckpointRecord, _MigrationSummary]; invoke() captures the summary across the migration → context-creation gap. - After context + delivery-worker setup but before the main execution loop, invoke() dispatches a synthetic NodeEvent with phase='checkpoint_migrated' carrying the summary on pre_state. Mirrors the checkpoint_saved pattern (state-on-pre, post=None) so the OTel observer's existing routing shape picks it up cleanly. - New phase 'checkpoint_migrated' added to NodeEvent's Literal union and to KNOWN_PHASES (rejects typos in observer.phases subscriptions). NodeEvent.pre_state typed as Any so the permissive _MigrationSummary payload fits; the OTel observer narrows defensively via isinstance. - OTelObserver._emit_checkpoint_migrate_span emits the span with the three normative attributes. Parented under the invocation root span (opened lazily if not yet present — versioned resumes are the first event on a new invocation_id, before any node-span emission). Mirrors _emit_checkpoint_save_span. - Two OTel unit tests: one positive (versioned resume emits the span with the right attrs) and one negative (version-match fast path emits no span, per spec §10.12.3). CHANGELOG entry promoted from 'deferred to follow-on' to the landed shape. Phase gated off default subscriptions so legacy observers don't surface a new event without opt-in.
There was a problem hiding this comment.
Pull request overview
Implements proposal 0014 (state migration for checkpoints) and bumps the spec submodule to v0.16.0. Repurposes CheckpointRecord.schema_version from a backend-internal record-shape constant to a user-facing state-schema identifier, adds a StateMigration/MigrationRegistry pair with BFS chain resolution, builds them into GraphBuilder (with_state_migration / with_state_migrations), and threads them through CompiledGraph._migrate_record on resume. Adds three new error categories (CheckpointStateMigrationMissing, CheckpointStateMigrationFailed, CheckpointStateMigrationChainAmbiguous), a new Checkpointer.supports_state_migration Protocol attribute, and a checkpoint_migrated synthetic observer phase that the OTel observer turns into an openarmature.checkpoint.migrate span.
Changes:
- New
openarmature.checkpoint.migrationmodule withStateMigration+MigrationRegistry(BFS resolution with duplicate-edge and multi-shortest-path ambiguity detection). - Engine-side resume routing in
CompiledGraph._migrate_recordplus a syntheticcheckpoint_migratedevent surfaced as a new OTel span. - Removal of the
CHECKPOINT_SCHEMA_VERSIONconstant and the SQLite version-equality rejection; conformance + unit test coverage and a docs section.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/openarmature/checkpoint/migration.py | New BFS migration registry + StateMigration dataclass |
| src/openarmature/checkpoint/errors.py | Three new canonical error categories with carriage attributes |
| src/openarmature/checkpoint/protocol.py | schema_version semantic shift; new supports_state_migration Protocol attr; constant removed |
| src/openarmature/checkpoint/backends/sqlite.py | Per-instance supports_state_migration; drops backend-level version-mismatch reject |
| src/openarmature/checkpoint/backends/memory.py | Declares migration ineligibility |
| src/openarmature/checkpoint/init.py | Re-exports new types/errors; drops removed constant |
| src/openarmature/graph/state.py | New schema_version: ClassVar[str] = "" on State |
| src/openarmature/graph/builder.py | with_state_migration[s] registration surface, threaded into compile() |
| src/openarmature/graph/compiled.py | _migrate_record resume routing, _MigrationSummary, synthetic checkpoint_migrated event dispatch |
| src/openarmature/graph/events.py | pre_state: Any; new checkpoint_migrated phase literal |
| src/openarmature/graph/observer.py | Adds checkpoint_migrated to KNOWN_PHASES |
| src/openarmature/observability/otel/observer.py | Emits openarmature.checkpoint.migrate span on the synthetic event |
| src/openarmature/init.py / pyproject.toml | Spec bump 0.15.0 → 0.16.0 |
| tests/conformance/test_state_migration.py | New conformance driver for fixtures 039–047 (range includes 047) |
| tests/conformance/test_fixture_parsing.py | Removes 039–046 from deferred-fixtures list |
| tests/conformance/harness/directives.py | StateSchema.schema_version; StateFieldSpec.required |
| tests/unit/test_state_migration.py | New unit coverage for registry, builder, and error carriage |
| tests/unit/test_observability_otel.py | Two new tests for migrate-span presence and version-match absence |
| tests/unit/test_checkpoint.py | Updates for new default schema_version + opaque SQLite round-trip |
| tests/test_smoke.py | Spec version smoke updated |
| docs/concepts/checkpointing.md | New "State migrations" concept section |
| CHANGELOG.md | Documents the additive surface, the schema_version semantic shift, and the spec bump |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- compiled.py: align save-side schema_version read on self.state_cls (was type(post_state)); _maybe_save_checkpoint drops @staticmethod so it can access self. Symmetric with the resume-side check; subclass schema_versions no longer shadow the declared graph schema. - compiled.py _apply_migration_step: re-raise CheckpointError subclasses unchanged before the bare-Exception wrap, so a canonical-category raise from a user migration propagates without being squashed as CheckpointStateMigrationFailed. Trim the wrap-message to the from/to identity + 'raised while migrating <label>' — __cause__ already carries the underlying exception; embedding type(exc).__name__: str(exc) duplicates info Python's traceback formatter shows anyway. - builder.py with_state_migrations: pre-validate the full input list (against existing registry + against earlier entries in the same call) before mutating. A duplicate-pair raise in the middle of the list no longer leaves earlier entries half-registered. Update the singular with_state_migration docstring to reflect the canonical raise type (CheckpointStateMigrationChainAmbiguous, not ValueError) + flag the empty-to_version ValueError. - migration.py register: reject empty to_version (un-declared sentinel is not a valid chain target per spec §10.2). Empty from_version stays valid for the documented bridging case. - migration.py resolve_chain: raise CheckpointStateMigrationChainAmbiguous directly on multi-shortest-path detection (was ValueError). The registry's ambiguity contract is now one type regardless of when it surfaces (register vs resolve). compiled.py's wrap removed. - protocol.py Checkpointer: clarify the supports_state_migration docstring on the Protocol-class-body vs runtime-attribute semantic — class-body '= False' is a typing-level signal, not a runtime guarantee. Engine uses getattr-with-default; backends SHOULD declare for Pyright honesty. - events.py NodeEvent: document the synthetic-phases conventions (step=-1, dotted node_name, non-State pre_state on checkpoint_migrated) in the class docstring so third-party observer authors don't need to read the engine source. - observer.py KNOWN_PHASES: expand the comment about the synthetic-phase opt-in and cross-reference the NodeEvent docstring. - tests/conformance/test_state_migration.py: range docstring updated from 039-046 to 039-047 with a one-paragraph note on the chain-ambiguity surface 047 covers. - tests/unit/test_state_migration.py: existing ambiguous-shortest-paths test asserts the canonical category (was ValueError); two new tests for the empty-to_version rejection + empty-from_version bridging acceptance; two new tests for with_state_migrations atomic pre-validation (no partial-state on internal-list and against-existing-registry duplicate cases).
Consolidated release for the five-PR batch: - Structured output (proposal 0016, PR #42) - Image content blocks (proposal 0015, PR #44) - Prompt management (proposal 0017, PR #45) - State migration for checkpoints (proposal 0014, PR #46) - Parallel branches (proposal 0011, PR #47) Bumps: - ``pyproject.toml`` project.version: 0.5.0 → 0.6.0 - ``__version__`` in src/openarmature/__init__.py - ``uv.lock`` editable package version - ``tests/test_smoke.py`` version assertion Flips CHANGELOG ``[Unreleased]`` to ``[0.6.0] — 2026-05-16``, drops the release-gate Notes entry, and tightens the pre-1.0 MINOR note to list the two behavioral changes (retry-MW attempt-index propagation, CheckpointRecord.schema_version semantic shift) instead of the structured-output-specific note carried over from PR-1. Pinned spec stays at v0.16.1 (set in PR #47).
Summary
openarmature.checkpoint+openarmature.graph. PR-4 of the five-PR batch following PR-1 (feat(llm): structured output (proposal 0016) #42), PR-2 (feat(llm): image content blocks (proposal 0015) #44), PR-3 (feat(prompts): prompt-management core (proposal 0017) #45).CheckpointRecord.schema_versionfrom the prior backend-internal record-shape role to a user-facing state-schema identifier per spec §10.2. Saved records carry whatever the user declared ontype(state).schema_versionat save time; on resume, version mismatches route through a registered migration chain instead of failing.openarmature.checkpoint:StateMigration(frozen dataclass —from_version/to_version/migratecallable) andMigrationRegistry(BFS chain resolution + duplicate-edge AND multi-shortest-path ambiguity detection per §10.12.2).GraphBuilder.with_state_migration(from, to, migrate)andwith_state_migrations(*migrations)populate the registry;compile()threads it through to theCompiledGraph. On resume, the engine readsrecord.schema_version, compares againsttype(state_cls).schema_version, and applies the resolved chain (or raises one of two new error categories).CheckpointStateMigrationMissing(no chain bridges the versions, or chain ambiguous) andCheckpointStateMigrationFailed(a migration function raised). Both non-transient. Post-migration deserialization failures still surface asCheckpointRecordInvalidper §10.12.4 — the three categories are mutually exclusive on any given resume.Checkpointer.supports_state_migration: boolmarks whether a backend can expose a structural intermediate form of the loaded state (the dict the migration registry consumes).SQLiteCheckpointer(serialization="json")opts in;SQLiteCheckpointer(serialization="pickle")andInMemoryCheckpointeropt out. Mismatched version against a non-migration-eligible backend raisesCheckpointRecordInvalidper §10.12.1.What's new
StateMigration,MigrationRegistryopenarmature.checkpoint. Registry'sresolve_chainruns BFS and raisesValueErroron ambiguous-shortest-paths per §10.12.2's load-time-detection allowance.CheckpointStateMigrationMissing,CheckpointStateMigrationFailedfrom_version/to_version/registered_migrations_count/registry_descriptionfor actionable diagnostics.State.schema_version: ClassVar[str] = ""GraphBuilder.with_state_migration/with_state_migrations(from, to)pairs raiseValueErrorat registration.Checkpointer.supports_state_migrationClassVar) so SQLite can compute it per-instance based on serialization mode.CompiledGraph._migrate_recordparent_statesin lockstep with the outer state per §10.12.2. Code comment records the spec-mandated equivalence so future contributors don't add per-parent metadata without a follow-on proposal.CHECKPOINT_SCHEMA_VERSIONschema_version. If future record-shape evolution needs versioning, backends add their own private field.Release gate
PR-4 of a five-PR batch (
0016→0015→0017→0014→0011). Do not tag a release until all five land — the CHANGELOG[Unreleased]Notes section carries the gate from PR-1. PR-5 (proposal 0011, parallel branches) is the only remaining piece.Commits
feat(checkpoint): state migration registry, types, errors, builder surfacetest(conformance): drive 0014 state-migration fixtures 039-046test(unit): state migration + docs(concepts) state migrations section + CHANGELOGNotable implementation details
schema_versionsemantic shift is a clean break. Pre-PR-4 records carrying the old"1"constant get reinterpreted as user-facing v1 identifiers. Per Chris's note, no compat shim — single-user dev, no production migration. CHANGELOG flags the shift underChanged.supports_state_migrationis an instance attribute, notClassVar. Started asClassVar[bool]per the original plan, butSQLiteCheckpointer's answer depends on the constructor'sserializationarg. The Protocol declares it asbool = False; backends override either at the class level (InMemoryCheckpointer = False) or per-instance (SQLite computed in__init__). Pyright accepts both forms.parent_states[i]before advancing to the next step. End state is identical to sequential ordering given pure migrations (spec §10.12.2 mandates purity), but lockstep gives cleaner failure attribution (step k failed on outer-or-parent-i) and sets up per-migration spans for a follow-on. Documented in the engine path's code comment.InMemoryCheckpointeris no longer migration-eligible. It holds live typed-state references; there's no class-independent intermediate. Mismatched-version resume against in-memory backend raisesCheckpointRecordInvalidper §10.12.1 — same category as before, slightly different reason.Conformance fixtures (039–046)
All 8 fixtures pass:
v1 → v2migration populates a new field; one migration ranv1 → v2 → v3chain runs in orderCheckpointStateMigrationMissingwithregistered_migrations_count: 0should_not_runmock never calledparent_states[i]CheckpointRecordInvalid(per §10.12.4)CheckpointStateMigrationMissingwithregistered_migrations_count: 1and the registered set inregistry_descriptionKeyError→CheckpointStateMigrationFailedwith theKeyErrorpreserved as__cause__Harness extensions
tests/conformance/test_state_migration.py:add_new_field_default,add_v2_field,add_v3_field,identity_passthrough,raises_keyerror,should_not_run,irrelevant. Each fixture'smigrate: <name>resolves through the library._MigrationTracewraps every mock to capture invocation order, supportingmigrations_run/migration_count/single_migration_invocation/migration_order_matches_chainassertions across all 8 fixtures._seed_recordpersists aCheckpointRecordmatching the fixture'sseeded_record:block beforeinvoke(resume_invocation=...)so the resume path has data to load.StateSchemamodel intests/conformance/harness/directives.pygains optionalschema_version(default"") and arequired: boolknob for fixture 044'srequired_v2_fieldno-default field._DEFERRED_FIXTURESintest_fixture_parsing.pyloses the 039-046 rows; the existingCasesFixturepermissive-extras path picks up the new shape unchanged.Test plan
uv run pytest— 644 pass, 64 skipped (down from 79; 8 fixtures × 2 = 16 fewer skips, minus 1 new docs example for the state-migrations concept section), 0 failed.uv run pyright— clean.uv run ruff check+uv run ruff format— clean.uv run --group docs mkdocs build --strict— clean.tests/unit/test_state_migration.pycovering BFS edge cases, multi-shortest-path ambiguity, GraphBuilder ergonomics, and__cause__preservation.docs/concepts/checkpointing.md#state-migrationsviamkdocs serveto confirm rendering.Pre-1.0 SemVer
Breaking-ish:
CheckpointRecord.schema_versionsemantic shift means existing records with the prior"1"constant value get reinterpreted as user-facing v1 identifiers. TheCHECKPOINT_SCHEMA_VERSIONmodule constant is removed. Pre-1.0 batch + single-user dev means no compatibility shim; CHANGELOG flags the shift underChanged.Additive otherwise: new types, new errors, new builder method, new state ClassVar, new Protocol attribute. Existing free-form callers (no
schema_versiondeclared on their state, no migrations registered) see no behavior change — records carry""and the engine never consults the registry.