Skip to content

feat(llm): structured output (proposal 0016)#42

Merged
chris-colinsky merged 24 commits into
mainfrom
feature/0016-structured-output
May 15, 2026
Merged

feat(llm): structured output (proposal 0016)#42
chris-colinsky merged 24 commits into
mainfrom
feature/0016-structured-output

Conversation

@chris-colinsky
Copy link
Copy Markdown
Member

@chris-colinsky chris-colinsky commented May 15, 2026

Summary

  • Implements spec proposal 0016 (LLM provider structured output) in openarmature.llm: response_schema parameter on Provider.complete(), Response.parsed field, StructuredOutputInvalid non-transient error category, OpenAI native response_format wire path with strict: true heuristic, prompt-augmentation fallback, and a Pydantic-class overload (class-in → BaseModel instance out).
  • Spec submodule bumps v0.10.0 → v0.15.0 under skip-ahead governance, covering the full 5-proposal batch. Fixtures from proposals 0011 / 0014 / 0015 / 0017 are marked deferred-skip in the conformance suite and unmark as each subsequent PR lands.
  • Conformance harness helpers (match_wire_body with "*" wildcards, assert_response_format_absent, assert_system_references_schema, assert_error_carries) land as capability-agnostic infrastructure under tests/conformance/harness/wire.py so upcoming 0014 / 0015 / 0017 PRs reuse without refactoring.

Release gate

This is PR-1 of a five-PR batch (0016 → 0015 → 0017 → 0014 → 0011). Do not tag a release until all five PRs land — the v0.15.0 submodule pin presumes the full batch will ship. The constraint is also recorded in the CHANGELOG [Unreleased] Notes section.

What's new

Surface Change
Provider.complete() New response_schema: dict | type[BaseModel] | None parameter. Defaults to None; the v0.4.0 free-form contract is preserved exactly.
Response.parsed New dict | BaseModel | None field. Populated when response_schema is supplied and the model returned structured content; absent on tool-call responses regardless.
StructuredOutputInvalid New error category. Non-transient by default — NOT in TRANSIENT_CATEGORIES. Carries response_schema, raw_content, failure_description.
validate_response_schema, strict_mode_supported Provider-agnostic helpers in openarmature.llm.provider. The strict-mode heuristic walks anyOf/oneOf/allOf branches and follows $ref with cycle protection; unresolvable refs conservatively return False.
OpenAIProvider constructor New force_prompt_augmentation_fallback: bool = False flag + uses_prompt_augmentation_fallback read-only property. Switches structured-output calls to the fallback path for OpenAI-compatible servers that reject or silently ignore response_format.
pyproject.toml Adds jsonschema>=4.0 runtime dep; spec_version bumped to 0.15.0.

Commits

The PR is reviewable commit-by-commit. Each commit independently builds and passes its targeted subset.

  1. chore: bump spec to v0.15.0; add jsonschema; skip deferred fixtures
  2. feat(llm): add StructuredOutputInvalid error category
  3. feat(llm): add Response.parsed field
  4. feat(llm): validate_response_schema + strict_mode_supported helpers
  5. feat(llm): Provider Protocol gains response_schema parameter
  6. feat(llm/openai): native response_format wire path + Pydantic overload
  7. feat(llm/openai): prompt-augmentation fallback + inspect property
  8. test(conformance): capability-agnostic harness helpers for wire + carries
  9. test: drive 0016 fixtures 021-028 + add structured-output unit tests
  10. docs: changelog entry for proposal 0016 under [Unreleased]

Test plan

  • uv run pytest tests/conformance/test_llm_provider.py — 16 pass, 12 skipped (0015 multimodal, lands in PR-2)
  • uv run pytest tests/unit/test_structured_output.py — 25 pass
  • uv run pytest — 483 pass, 77 skipped, 0 failed
  • uv run pyright — clean
  • uv run ruff check + uv run ruff format — clean
  • Manual: structured-output call against a live OpenAI-compatible endpoint (dict schema + Pydantic class) with Response.parsed verified end-to-end.

Pre-1.0 SemVer

Additive change. Free-form callers (no response_schema) see no behavior change — the new parameter defaults to None, the wire body omits response_format, and Response.parsed remains absent.

Spec submodule moves from v0.10.0 to v0.15.0 — covers the full
5-proposal batch (0011, 0014, 0015, 0016, 0017) in one bump per
the skip-ahead governance principle. spec_version in pyproject.toml
bumped to match.

Adds jsonschema>=4.0 as a runtime dependency (used by the
forthcoming structured-output validation path on the dict-schema
side; Pydantic-class path uses its own validator).

Adds skip markers to the conformance test files for fixtures whose
runtime support lands in a later PR of the batch:

- llm-provider 009-020 → 0015 multimodal (PR-2)
- llm-provider 021-028 → 0016 structured output (this PR, wired up
  in a later commit)
- pipeline-utilities 032-038 → 0011 parallel branches (PR-5)
- pipeline-utilities 039-046 → 0014 state migration (PR-4)
- graph-engine 021-observer-branch-name → 0011 parallel branches
  (PR-5)

Skip markers also apply to test_fixture_parsing.py for the same set
— the typed harness models in tests/conformance/harness/ don't yet
know about the new directive shapes (state_migration, parallel
branches state-schema variation, NodeEvent.branch_name); each
deferring PR drops its own skip rows when it lands the harness
work.
Adds the structured_output_invalid canonical category. Raised when a
complete() call requested a response_schema and the provider's
returned content could not be parsed as JSON OR did not validate
against the schema. The exception carries response_schema,
raw_content, and failure_description attributes for caller
introspection.

Non-transient by default — NOT added to TRANSIENT_CATEGORIES. The
default RetryMiddleware classifier will not retry this category;
callers wanting retry-on-validation-failure can include the category
in a custom classifier's transient set.
Adds the parsed field to the Response record. Default None, populated
by structured-output calls (response_schema set on complete() and the
model returned structured content). The runtime type is a discriminated
union over dict (when the caller passed a JSON-Schema dict) and
BaseModel instance (when the caller passed a Pydantic class).

Pydantic Response config now allows arbitrary types so a BaseModel
instance can sit in the parsed slot. No public surface change for
free-form callers — parsed defaults to None and remains None when
response_schema is not supplied.
Adds two provider-agnostic helpers in openarmature.llm.provider used
by structured-output Provider implementations:

- validate_response_schema(schema) — pre-send structural check that
  the value is a dict and its top-level type is "object". Raises
  ProviderInvalidRequest on failure.

- strict_mode_supported(schema) — whether the schema satisfies the
  strict-mode constraint set (additionalProperties not true,
  properties fully covered by required) across the full schema tree.
  Walks anyOf/oneOf/allOf branches and follows $ref targets with
  cycle protection. An unresolvable $ref or unknown shape returns
  False (conservative fail).

Both are exported from openarmature.llm so OpenAI-compatible
providers and any future Anthropic/Gemini provider share the same
constraint heuristic.
Extends the Provider Protocol's complete() method signature to accept
an optional response_schema parameter. Accepts either a JSON Schema
dict or a Pydantic BaseModel subclass; the implementation converts
the class form to a JSON Schema at the boundary.

Free-form callers (response_schema=None or absent) see no behavior
change — the parameter defaults to None and the v0.4.0 contract is
preserved.

OpenAIProvider's complete() still has the v0.4.0 signature; the next
commit wires the response_schema parameter through it.
Threads response_schema through OpenAIProvider.complete() → _do_complete()
→ _parse_response(). Accepts either a JSON Schema dict OR a Pydantic
BaseModel subclass; the latter is converted via model_json_schema()
at the boundary.

Native wire path: when response_schema is supplied, the request body
includes response_format: { type: "json_schema", json_schema: { name,
schema, strict } }. The name field comes from schema.title when
non-empty, otherwise a deterministic sha256 hash of the schema. The
strict flag is set per strict_mode_supported() — true only when the
schema cleanly satisfies the constraints across the full tree.

Post-receive: parses message.content as JSON, then validates against
the schema. Dict-input path validates with jsonschema and returns a
dict. BaseModel-class-input path validates with model.model_validate()
and returns a BaseModel instance. Either way, JSON parse failure or
schema validation failure raises StructuredOutputInvalid carrying the
schema, raw content, and failure description.

parsed is absent on tool-call responses regardless of whether
response_schema was supplied (mutually exclusive paths). Free-form
calls (response_schema=None) see no behavior change — body omits
response_format, parsed stays None.

The prompt-augmentation fallback path is the next commit.
Adds the prompt-augmentation fallback for OpenAI-compatible servers
that don't implement response_format (older vLLM, some LM Studio
releases, llama.cpp variants).

Constructor:
  force_prompt_augmentation_fallback: bool = False

When True, structured-output calls build the wire body by augmenting
the message list with a system directive that includes the serialized
JSON Schema, and omit response_format entirely. Native path is the
default (False).

Inspect property:
  uses_prompt_augmentation_fallback -> bool

Read-only; lets callers verify which wire path is active without
poking private state.

_augment_messages_with_schema_directive returns a fresh list. When
the first message is system, its content is extended with the schema
directive (preserving caller intent); otherwise a new system message
is prepended. The caller's original messages list is NOT mutated —
Message instances are reused unchanged (immutable Pydantic models).

Response parsing is unchanged from the native path: parse + validate
post-receive raise StructuredOutputInvalid on failure. parsed is
populated identically whether the wire took the native or fallback
route.
…ries

Adds tests/conformance/harness/wire.py with helpers used by structured-
output and content-block fixtures (and any future capability fixtures
that need the same shapes):

- match_wire_body(actual, expected) — recursive deep-equal with "*"
  wildcard support for string slots.
- assert_response_format_absent(body) — asserts the wire body has no
  response_format key.
- assert_system_references_schema(body, schema) — asserts the first
  message in the body is a system message whose content contains the
  canonical-JSON form of the schema as a substring.
- assert_error_carries(exc, carries) — introspects a raised
  exception's attributes against an expected_carries block; supports
  _present / _mentions / literal-equal forms; handles the
  raw_response_content → raw_content fixture-vs-impl naming alias.

Extends test_llm_provider.py to drive these from the existing
fixture loop:

- response_schema is read from call_spec and threaded through
  provider.complete().
- expected_wire_request literal compare + expected_wire_request_checks
  sibling checks fire after each captured chat-completions request.
- caller_messages_unmodified takes a model_dump snapshot pre-call and
  asserts byte-equality post-call.
- expected.response.parsed is compared for equality.
- expected.raises.carries is fed to assert_error_carries.
- retry_middleware: block wraps the call in a default-classifier
  retry simulator (transient = TRANSIENT_CATEGORIES membership); the
  captured-request count provides provider_call_count.
- mock_provider.capabilities.supports_native_response_format: false
  constructs the provider with force_prompt_augmentation_fallback=True.

The 0016 structured-output fixtures (021–028) remain skipped at this
commit. The next commit removes their skip markers.
Removes the deferred-fixture skip markers for the 8 structured-output
conformance fixtures (021–028). All pass against the OpenAIProvider
+ harness extensions landed in earlier commits.

Adds tests/unit/test_structured_output.py covering bits the
conformance fixtures don't exercise directly:

- validate_response_schema edge cases: non-dict, non-object top-level,
  missing type.
- strict_mode_supported: required-coverage rule, additionalProperties
  true, nested-object violation, anyOf branch violation, internal $ref
  resolution, unresolvable $ref, $ref cycle (self-referential schema).
- _derive_schema_name: title-when-present, hash-fallback, determinism,
  empty-title behavior.
- _augment_messages_with_schema_directive: prepend-when-no-system,
  extend-existing-system, caller-list-not-mutated, serialized-schema-
  substring.
- Pydantic-class overload: class-in returns validated BaseModel
  instance; pydantic ValidationError wraps in StructuredOutputInvalid;
  wire body produced from class equals wire body produced from the
  equivalent .model_json_schema() dict.
- uses_prompt_augmentation_fallback inspect property: False by
  default, True when constructor flag is set.
Documents the structured-output surface added in this PR: the
response_schema parameter, Response.parsed field, StructuredOutputInvalid
error category, OpenAIProvider native + fallback wire paths, the
provider-agnostic schema helpers, the capability-agnostic conformance
harness extensions, and the jsonschema runtime dependency.

Also records:
- Spec pin bump 0.10.0 → 0.15.0 (skip-ahead governance) with
  per-proposal deferred-skip in the conformance suite until each PR
  lands.
- Release gate: do not tag the consolidated release until all five
  PRs of the batch (0011, 0014, 0015, 0016, 0017) are merged.
Copilot AI review requested due to automatic review settings May 15, 2026 15:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements structured-output support for the LLM provider surface, including schema-aware completion requests, parsed responses, structured-output validation errors, OpenAI native response_format, prompt-augmentation fallback, and conformance/unit coverage for proposal 0016.

Changes:

  • Adds response_schema handling, Response.parsed, schema validation helpers, and StructuredOutputInvalid.
  • Extends OpenAIProvider with native structured-output requests and fallback prompt augmentation.
  • Adds conformance harness utilities, structured-output unit tests, dependency updates, and changelog/spec-version updates.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
uv.lock Locks jsonschema and transitive dependencies.
tests/unit/test_structured_output.py Adds focused tests for schema validation, strict-mode heuristics, fallback helpers, and Pydantic parsing.
tests/conformance/test_pipeline_utilities.py Defers later proposal fixtures in pipeline conformance.
tests/conformance/test_llm_provider.py Adds structured-output fixture support, retry simulation, wire assertions, and deferred multimodal skips.
tests/conformance/test_fixture_parsing.py Skips parser checks for deferred fixture shapes.
tests/conformance/test_conformance.py Defers graph-engine fixture requiring later proposal support.
tests/conformance/harness/wire.py Adds reusable wire-body and error-carry assertion helpers.
tests/conformance/harness/__init__.py Exports new conformance wire helpers.
src/openarmature/llm/response.py Adds ParsedValue and Response.parsed.
src/openarmature/llm/providers/openai.py Implements structured-output request/response handling and fallback augmentation.
src/openarmature/llm/provider.py Extends provider protocol and adds schema/strict-mode helpers.
src/openarmature/llm/errors.py Adds StructuredOutputInvalid category and exception class.
src/openarmature/llm/__init__.py Exports new structured-output APIs.
pyproject.toml Adds jsonschema dependency and bumps spec version.
CHANGELOG.md Documents structured-output feature and release gate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/openarmature/llm/provider.py Outdated
Comment thread src/openarmature/llm/providers/openai.py Outdated
Comment thread src/openarmature/llm/provider.py
Comment thread src/openarmature/llm/providers/openai.py
Comment thread src/openarmature/llm/providers/openai.py Outdated
Comment thread src/openarmature/llm/providers/openai.py Outdated
Comment thread tests/conformance/harness/wire.py Outdated
Comment thread tests/conformance/test_llm_provider.py
Addresses the 8 CoPilot review threads on the structured-output PR:

- strict_mode_supported now requires additionalProperties to be
  EXPLICITLY false (not just missing-or-false). Missing implies the
  JSON Schema default of permitting extras, which OpenAI's strict
  mode rejects. Pydantic's .model_json_schema() omits the key by
  default, so the class-input path would have 400ed against OpenAI
  even with conformance fixtures passing.
- _normalize_response_schema now raises ProviderInvalidRequest when
  the class form is not a BaseModel subclass, instead of letting
  AttributeError leak from model_json_schema.
- validate_response_schema now runs jsonschema.Draft202012Validator
  .check_schema() at the boundary, wrapping SchemaError as
  ProviderInvalidRequest. Malformed schemas now fail at the API
  boundary instead of escaping at decode time.
- _derive_schema_name now regex-checks the title against OpenAI's
  name constraint (^[a-zA-Z0-9_-]{1,64}$) and falls back to the
  hashed name when the title doesn't match. Sanitizing-in-place
  would silently mutate user intent; the hash is a more honest
  fallback.
- Two comments claiming Message instances are immutable Pydantic
  models were updated. The models are not configured with
  frozen=True; the safety actually comes from the helpers not
  modifying them in place.
- match_wire_body now fails on extra keys in actual. The previous
  permissive default defeated the point of expected_wire_request
  being a literal compare; partial assertions continue to live in
  the sibling expected_wire_request_checks block.
- _iter_calls now propagates expected_wire_request,
  expected_wire_request_checks, response_schema, and
  retry_middleware from sibling-of-call into the call dict. Only
  expected was being copied before. Cases-form fixtures with
  case-level wire expectations were silently running without those
  assertions.

The _iter_calls fix surfaced two pre-existing gaps in the harness's
handling of cases-shape fixtures, fixed inline:
- The harness was never wiring config from the call spec into
  provider.complete(); fixture 005's runtime_config_passthrough
  case was effectively a no-op.
- OpenAIProvider was using json.dumps default formatting for
  tool_call.function.arguments (with spaces after colons), which
  doesn't match the canonical compact form OpenAI emits or the
  spec's fixture 005 expectations. Switched to compact form.

New unit tests cover the missing-additionalProperties strict-mode
case, the non-BaseModel class rejection, the malformed JSON Schema
rejection, and the title-falls-back hash cases.
Replaces the no-LLM hello-world in README.md with a version that
makes a real LLM call via OpenAIProvider and uses a Pydantic class
as the response_schema. The resulting Response.parsed flows through
state as a typed Classification instance and drives the conditional
edge that routes between research and summarize.

Defaults to OpenAI public API (gpt-4o-mini) with env-var config:
LLM_BASE_URL, LLM_MODEL, LLM_API_KEY. A trailing line in the README
calls out OpenRouter, vLLM, LM Studio, llama.cpp as drop-in swaps
via base_url/model.

The example also lands as a runnable file at
examples/00-hello-world/main.py and is added to the smoke test
suite. examples/README.md gets a corresponding entry.
Copilot AI review requested due to automatic review settings May 15, 2026 18:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 12 comments.

Comments suppressed due to low confidence (1)

tests/unit/test_structured_output.py:470

  • This test also leaves the provider’s httpx.AsyncClient open. Please close the provider after the assertion (consistent with the other OpenAIProvider tests) so the test suite does not accumulate unclosed clients.
    provider = OpenAIProvider(
        base_url="http://mock-llm.test",
        model="test-model",
        api_key="test-key",
        force_prompt_augmentation_fallback=True,

Comment thread examples/00-hello-world/main.py Outdated
Comment thread README.md Outdated
Comment thread src/openarmature/llm/errors.py
Comment thread src/openarmature/llm/provider.py Outdated
Comment thread src/openarmature/llm/provider.py Outdated
Comment thread src/openarmature/llm/providers/openai.py
Comment thread src/openarmature/llm/provider.py
Comment thread tests/unit/test_structured_output.py
Comment thread examples/00-hello-world/main.py
Comment thread README.md
Two bugs surfaced during live validation against OpenAI:

- The default LLM_BASE_URL was https://api.openai.com/v1, but our
  OpenAIProvider's wire path posts to /v1/chat/completions itself.
  httpx URL join produced https://api.openai.com/v1/v1/chat/completions
  → 404. Convention is base_url = host root; impl adds /v1. Default
  now matches; doc-string + README comment make it explicit.
- The observer trace fired on the OpenAIProvider LLM-span event
  (sentinel namespace, post_state=None) and crashed accessing .sources.
  Added a post_state is not None guard.
The hello-world's research and summarize nodes were returning
hard-coded source lists. Replaces both with real provider.complete()
calls that emit typed structured output, so the example demonstrates
the value of a structured-output pipeline end-to-end instead of just
the framework's plumbing.

The example now exercises both response_schema forms in one demo:
- classify and summarize use Pydantic classes (Classification,
  Summary); Response.parsed comes back as a validated instance.
- research uses a raw JSON Schema dict; Response.parsed comes back
  as a plain dict.

State gains two intermediate-artifact fields (research_plan,
summary). Final output prints whichever branch fired, in addition
to the existing sources/metadata. The reducer-policy story stays
intact (last_write_wins on the LLM outputs, append on sources,
merge on metadata).

Live-validated against OpenAI gpt-4o-mini; both branches verified
(structured class instance + structured dict on Response.parsed).
Copilot AI review requested due to automatic review settings May 15, 2026 18:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

tests/unit/test_structured_output.py:472

  • This test also creates an OpenAIProvider without closing it. Please close the provider after the assertion (or make the test async and use await provider.aclose()) to avoid unclosed-client resource warnings.
def test_inspect_property_fallback_when_forced() -> None:
    provider = OpenAIProvider(
        base_url="http://mock-llm.test",
        model="test-model",
        api_key="test-key",
        force_prompt_augmentation_fallback=True,
    )
    assert provider.uses_prompt_augmentation_fallback is True

Comment thread src/openarmature/llm/providers/openai.py
Comment thread src/openarmature/llm/provider.py Outdated
Comment thread src/openarmature/llm/provider.py Outdated
Comment thread examples/00-hello-world/main.py
Comment thread README.md
Comment thread tests/unit/test_structured_output.py Outdated
Comment thread src/openarmature/llm/provider.py Outdated
Adds docs/concepts/llms.md covering how LLM calls fit into the graph
model: LLM calls as async IO inside nodes, structured output (both
response_schema forms + native/fallback wire paths + strict mode),
routing on parsed fields, and errors at the LLM boundary. Nav entry
added to mkdocs.yml's Concepts section; concepts/index.md TOC
extended.

Updates docs/model-providers/index.md: Protocol signature now shows
the response_schema parameter; errors table adds StructuredOutputInvalid;
new Structured output section walks through both response_schema
forms, the native/fallback wire paths, and strict-mode constraints.

Updates docs/model-providers/authoring.md: skeleton's complete()
signature now matches the Protocol (response_schema parameter); a
new "Structured output" entry in Beyond the skeleton points custom-
provider authors at validate_response_schema and strict_mode_supported.

mkdocs builds clean in strict mode; the runnable example in the new
Structured output section is verified by tests/test_docs_examples.py.
The Returns block on Provider.complete started with "A :class:Response
carrying ...", which mkdocstrings' Google-parser misread as a
name-type pair: it pulled out "A" as the Name column entry and split
the multi-line description across three table rows.

Moving the return-value sentence into the prose summary at the top of
the docstring (matching the pattern OpenAIProvider.complete already
uses) renders cleanly: no spurious Name column entry, single
description block.
Copilot AI review requested due to automatic review settings May 15, 2026 19:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

tests/unit/test_structured_output.py:472

  • This test also leaves the OpenAIProvider's underlying httpx.AsyncClient unclosed. Make the test close the provider after the assertion to keep the unit suite free of leaked async-client resources.
def test_inspect_property_fallback_when_forced() -> None:
    provider = OpenAIProvider(
        base_url="http://mock-llm.test",
        model="test-model",
        api_key="test-key",
        force_prompt_augmentation_fallback=True,
    )
    assert provider.uses_prompt_augmentation_fallback is True

Comment thread tests/conformance/test_llm_provider.py
Comment thread docs/model-providers/authoring.md Outdated
Comment thread examples/00-hello-world/main.py
Comment thread tests/unit/test_structured_output.py Outdated
Comment thread docs/concepts/llms.md Outdated
Addresses 19 review threads from the second CoPilot pass; about half
were duplicates of the same underlying issue:

- examples/00-hello-world/main.py + README hello-world: api_key now
  uses `os.environ.get("LLM_API_KEY") or None` so an exported-but-
  empty env var falls through to no-auth (matters for local servers
  that reject an empty bearer header).

- Both examples now close the OpenAIProvider in the finally block
  alongside graph.drain(). Long-running consumers that copy the
  snippet had been leaking the underlying httpx.AsyncClient.

- errors.py header dropped the hard-coded "seven canonical
  categories" count after StructuredOutputInvalid landed.

- strict_mode_supported docstring and the surrounding spec-anchor
  comment block both updated to match the implementation:
  additionalProperties must be EXPLICITLY false (an omitted key
  counts as non-strict, since JSON Schema's default permits extras).

- _resolve_ref now handles ref == "#" as the document root before
  rejecting external refs. Root-recursive schemas that use the bare
  JSON-Pointer-root form now resolve correctly. Unit test added.

- _strict_mode_check tightened to return False on unrecognized
  shapes (empty {}, const-only, enum-only, unknown keywords) instead
  of falling through to True. Primitive types (string/integer/
  number/boolean/null) classified as terminal-strict-compatible.
  Two unit tests added.

- _build_request_body now explicitly strips response_format from the
  body when the provider is in fallback mode. RuntimeConfig is
  extra="allow", so a caller could have piped response_format
  through the extras loop past the include_response_format gate.

- provider.py module docstring's summary signature line updated to
  match the Protocol's response_schema parameter.

- validate_response_schema's spec-anchor comment updated to reflect
  that JSON Schema validity is now checked at the boundary via
  Draft202012Validator.check_schema(), not delegated to parse time.

- test_pydantic_class_wire_body_matches_dict_form: widened the
  assertion from response_format-only to full body equality, so any
  regression in the class-input wire mapping (not just
  response_format) gets caught.

- test_inspect_property_native_default and
  test_inspect_property_fallback_when_forced converted to async
  with try/finally + aclose() to match the rest of the file's
  provider-lifecycle pattern.
Addresses 5 remaining review threads (3 substantive, 2 stale on
already-fixed code):

- LlmProviderResponseAssertion (the typed assertion model in
  harness/expectations.py) now lists `parsed: Any | None`. The
  runtime assertion in test_llm_provider.py already handled it, but
  the typed parser had it under extra="forbid" and would have
  rejected any future case-shape LLM fixture using `parsed`. The
  021-028 fixtures slip past today on `calls:` form's permissive
  `LlmCallSpec.expected: dict[str, Any]`; this lines the two paths
  up.

- docs/model-providers/authoring.md skeleton comment tightened:
  removed the "ignore it and return free-form text" option from
  the response_schema guidance. A provider that silently drops the
  parameter violates the Protocol contract; callers expect either
  Response.parsed populated or StructuredOutputInvalid raised. Now
  only two valid options surfaced: raise ProviderInvalidRequest
  until implemented, or wire it through.

- docs/concepts/llms.md softened the static-typing claim in the
  Pydantic-class form section. Response.parsed is
  `dict[str, Any] | BaseModel | None`, so a type checker won't
  narrow from `response_schema=Classification` alone. The page now
  separates the runtime guarantee (validated instance) from static
  access (requires cast/isinstance/typed assignment); generic
  Response[T] flagged as a follow-up.

The two stale threads (examples/00-hello-world/main.py provider
cleanup, test_structured_output.py provider cleanup) were already
fixed in commit 8ed334c; replies sent + threads resolved without
code changes.
Copilot AI review requested due to automatic review settings May 15, 2026 20:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 26 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (2)

tests/unit/test_structured_output.py:150

  • The root object is missing additionalProperties: false, so this assertion would still pass even if the anyOf walker were removed or broken. Make the root object strict-compatible and keep the failing condition only inside the anyOf branch to ensure this test covers the intended combinator behavior.
    schema = {
        "type": "object",
        "properties": {
            "x": {
                "anyOf": [
                    {"type": "string"},
                    {"type": "object", "properties": {"y": {"type": "string"}}},  # no required
                ]
            },
        },
        "required": ["x"],
    }

tests/unit/test_structured_output.py:177

  • This test is meant to prove external $ref targets make strict mode unsupported, but the root schema already fails because it lacks additionalProperties: false. Add the root strict fields here so the only failing condition is the unresolvable reference.
    schema = {
        "type": "object",
        "properties": {"x": {"$ref": "https://example.com/external-schema.json"}},
        "required": ["x"],
    }

Comment thread src/openarmature/llm/providers/openai.py Outdated
Comment thread docs/model-providers/authoring.md
Comment thread docs/concepts/llms.md
Comment thread src/openarmature/llm/provider.py
Comment thread pyproject.toml
Comment thread tests/unit/test_structured_output.py
Addresses 6 review threads, several of which surfaced second-order
issues from previous rounds:

- openai.py complete(): the fallback flag was driving
  include_response_format=False for every call, including free-form
  ones. That triggered the response_format strip on calls that
  weren't structured-output at all, clobbering caller-supplied
  RuntimeConfig extras. Gating the flag on schema_dict being set so
  free-form calls preserve extras. Unit test added.

- src/openarmature/__init__.py + tests/test_smoke.py: bumped
  __spec_version__ from "0.10.0" to "0.15.0" to match the
  pyproject.toml [tool.openarmature].spec_version bump. AGENTS.md
  flags these three values as required to stay in sync; the
  submodule-bump commit missed the runtime sources.

- _strict_mode_check array branch: {"type": "array"} without
  `items` no longer returns True. Unconstrained array content is
  the array analog of an object with no additionalProperties: false:
  the walker can't statically verify nested shapes, so strict mode
  rejects. Unit test added.

- docs/model-providers/authoring.md: skeleton's complete() now
  actually enforces what its comment promised. Added
  `if response_schema is not None: raise ProviderInvalidRequest`
  to the body and surfaced the exception in the import list, so a
  provider copied from the skeleton can't silently violate the
  Protocol contract.

- docs/concepts/llms.md Pydantic-class snippet: added
  `from typing import Literal` so the example is copy-paste-
  runnable (the snippet uses Literal in the class but only imported
  BaseModel).

- tests/unit/test_structured_output.py nested-recursion tests:
  test_strict_mode_recurses_into_nested_object and
  test_strict_mode_anyof_branch_must_satisfy were short-circuiting
  at the root because the root schema itself failed strict rules.
  Tightened both root schemas so the recursive walk actually fires;
  the tests now guard the recursion they claim to.
Captures two follow-ups surfaced by the four CoPilot review rounds:

- docs/concepts/llms.md "Strict mode" section expanded into the
  full constraint list. After four rounds of tightening the
  strict_mode_supported heuristic, the rule set is stable and the
  user-facing surface should list it directly rather than make
  callers read provider.py. The page frames the list as the
  authoritative set: anything not on it trips to non-strict.

- docs/model-providers/index.md "Strict mode" subsection trimmed
  and now links into the concepts page for the full list,
  following the established split (concepts/ owns the deep-dive,
  model-providers/ stays terse).

- tests/test_smoke.py adds test_spec_version_matches_pyproject:
  reads pyproject.toml's [tool.openarmature].spec_version and
  asserts it equals openarmature.__spec_version__. AGENTS.md
  flags these as required to stay in sync; the previous smoke
  test only checked internal consistency between __spec_version__
  and its asserted value, so the pyproject side could drift
  silently (and did, in the original submodule-bump commit).
Copilot AI review requested due to automatic review settings May 15, 2026 21:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 28 changed files in this pull request and generated 4 comments.

Comment thread tests/test_examples_smoke.py
Comment thread src/openarmature/llm/providers/openai.py
Comment thread tests/unit/test_structured_output.py
Comment thread tests/test_smoke.py Outdated
Addresses 4 review threads:

- examples/00-hello-world/main.py: provider construction moved
  from module level to a lazy _get_provider() helper backed by a
  module global. Avoids opening an httpx.AsyncClient when tooling
  imports the module without running main() — the smoke test now
  doesn't trigger construction across 6 example loads. main()'s
  finally only closes when the cached instance is set.

- src/openarmature/llm/provider.py: validate_response_schema now
  walks all $ref values via _check_refs_resolvable and raises
  ProviderInvalidRequest for any non-internal-resolvable ref.
  Draft202012Validator.check_schema doesn't traverse refs, so
  previously an external ref slipped past the boundary and
  surfaced as a raw referencing-library exception at validate
  time. Pre-validation surfaces the clean category at the API
  boundary.

- src/openarmature/llm/providers/openai.py: _parse_and_validate
  now also catches jsonschema.SchemaError and maps it to
  StructuredOutputInvalid. Safety net for any schema-side
  exception (including ref-resolution failures) that pre-
  validation might miss.

- tests/unit/test_structured_output.py:
  - test_strict_mode_unresolvable_ref_fails: root tightened with
    additionalProperties: false so the walk reaches the $ref
    branch (was short-circuiting at the root).
  - Added test_validate_response_schema_rejects_external_ref
    covering the new pre-validation path.

- tests/test_smoke.py: added test_spec_version_matches_submodule_pin
  shelling to `git -C openarmature-spec describe --tags
  --exact-match HEAD` and asserting it equals
  v{__spec_version__}. Skips cleanly when the submodule isn't a
  git checkout (installed-package CI lanes). Completes the
  three-place drift check from AGENTS.md
  (__spec_version__ ↔ pyproject ↔ submodule pin).
Comment thread tests/test_smoke.py Fixed
The git-describe-based submodule check from the previous commit
passed locally but failed in CI because actions/checkout pins the
submodule to its recorded SHA without fetching the spec repo's
tags. `git describe --tags --exact-match` then finds nothing and
the test fails with "submodule HEAD is not at any tag."

Switching to parsing openarmature-spec/CHANGELOG.md: the spec
follows Keep a Changelog, so the first non-[Unreleased]
`## [X.Y.Z]` heading is the version at the pinned commit. This
works regardless of CI tag-fetch state and catches the same drift
class (submodule moved to a different release).

Skips cleanly when CHANGELOG.md isn't present (installed-package
lanes that don't ship the submodule checkout).
Copilot AI review requested due to automatic review settings May 15, 2026 22:18
Comment thread tests/test_smoke.py Fixed
CodeQL flagged the for/else: pytest.fail() pattern as a
potentially-uninitialized-local-variable warning because it
doesn't model pytest.fail as NoReturn — the analyzer sees a path
where submodule_latest is referenced after the loop without ever
being bound.

Pulling the parse into _read_latest_spec_version_from_changelog
that explicitly returns the version or raises AssertionError.
Eliminates the unreachable-after-fail pattern and reads cleaner.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 28 changed files in this pull request and generated 6 comments.

Comment thread src/openarmature/llm/provider.py Outdated
Comment thread src/openarmature/llm/provider.py Outdated
Comment thread src/openarmature/llm/provider.py Outdated
Comment thread src/openarmature/llm/provider.py Outdated
Comment thread docs/concepts/llms.md Outdated
Comment thread src/openarmature/llm/providers/openai.py
Six second-order correctness fixes surfaced by the round-7 review,
mostly hardening _resolve_ref, _check_refs_resolvable, and the
Pydantic-class validation path.

- _resolve_ref now distinguishes "unresolvable" (path doesn't exist
  / external ref) from "resolved to non-dict" via a module-level
  _UNRESOLVABLE sentinel. Boolean schemas (true/false) are valid
  JSON Schema subschemas; a $ref to one was being incorrectly
  rejected as ProviderInvalidRequest. Now resolves cleanly and
  strict-mode still returns False on bool targets (the correct
  conservative answer).

- validate_response_schema's metaschema check now uses
  jsonschema.validators.validator_for(schema) instead of the
  hard-coded Draft 2020-12. A valid draft-07 schema (e.g. tuple-
  form items, common in tooling) was being rejected at the
  boundary but accepted at runtime. Boundary and runtime now agree.

- _resolve_ref percent-decodes JSON Pointer tokens before applying
  the ~1 / ~0 unescape pair. Per RFC 6901 §6, a JSON Pointer in a
  URI fragment is percent-encoded; refs like
  #/$defs/Name%20With%20Spaces now resolve correctly.

- _check_refs_resolvable now walks only known subschema-bearing
  keywords (properties, patternProperties, additionalProperties,
  items, prefixItems, contains, if/then/else, allOf/anyOf/oneOf/not,
  $defs/definitions, dependentSchemas, propertyNames,
  unevaluatedItems, unevaluatedProperties). A "$ref" key under data
  positions (default, const, enum, $comment, x-* extensions) is
  data, not a schema reference, and is no longer incorrectly
  resolved.

- docs/concepts/llms.md "LLM calls are async IO inside a node"
  section reframed: module-level provider construction leaks the
  httpx.AsyncClient in tooling/test/docs-build imports. The page
  now documents application-startup / lifecycle-managed
  construction (lazy on-first-use plus aclose in finally / shutdown
  hook), matching the pattern the hello-world example was made
  lazy for.

- _parse_and_validate's Pydantic-class path now runs
  jsonschema.validate against the generated JSON Schema BEFORE
  calling model_validate. Pydantic's default model_validate is
  coercive (accepts "30" for an int field), which diverged from
  the strict dict-schema path. Both paths now apply the same
  jsonschema check first; model_validate then constructs the
  typed instance.

- jsonschema.ValidationError's failure description now includes
  exc.json_path (e.g. "$.age: '30' is not of type 'integer'"). The
  bare exc.message lost the field name, breaking caller diagnostics
  for the missing-field / wrong-type-at-path cases.

Five new unit tests cover the bool-ref, draft-07, percent-encoded
ref, ref-under-data, and Pydantic-coercion-rejection cases.
@chris-colinsky chris-colinsky merged commit 2ecb7b1 into main May 15, 2026
6 checks passed
@chris-colinsky chris-colinsky deleted the feature/0016-structured-output branch May 15, 2026 22:44
chris-colinsky added a commit that referenced this pull request May 17, 2026
Consolidated release for the five-PR batch:

- Structured output (proposal 0016, PR #42)
- Image content blocks (proposal 0015, PR #44)
- Prompt management (proposal 0017, PR #45)
- State migration for checkpoints (proposal 0014, PR #46)
- Parallel branches (proposal 0011, PR #47)

Bumps:

- ``pyproject.toml`` project.version: 0.5.0 → 0.6.0
- ``__version__`` in src/openarmature/__init__.py
- ``uv.lock`` editable package version
- ``tests/test_smoke.py`` version assertion

Flips CHANGELOG ``[Unreleased]`` to ``[0.6.0] — 2026-05-16``, drops
the release-gate Notes entry, and tightens the pre-1.0 MINOR note to
list the two behavioral changes (retry-MW attempt-index propagation,
CheckpointRecord.schema_version semantic shift) instead of the
structured-output-specific note carried over from PR-1.

Pinned spec stays at v0.16.1 (set in PR #47).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants