fix(memory): self-correct when LLM returns non-JSON during memory extraction (#1541)#1833
fix(memory): self-correct when LLM returns non-JSON during memory extraction (#1541)#1833voidborne-d wants to merge 1 commit intovolcengine:mainfrom
Conversation
…correct (volcengine#1541) When the LLM returns plain prose instead of the operations schema (a known failure mode for weaker models when `tool_choice="auto"` lets the model pick text generation), `_call_llm` logged the warning and returned `(None, None)` without appending the failed response to `messages`. The next iteration then saw the *exact same* prompt and repeated the same drift; after `max_iterations` the run ended with "Extracted 0 memories" — the symptom reported in volcengine#1541. This change persists the failed assistant content + a corrective user message that re-states the schema and forbids prose, so the next call gives the model concrete context to recover from. The same path now also catches `_validate_operations` ValueErrors (operations parsed but URIs disallowed) so iteration N+1 can correct invalid URIs instead of silently dropping the response. The force-final-iteration prompt is also reworded to embed the JSON schema directly, so the last attempt has a strong, schema-anchored instruction even when no prior failures were appended.
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
|
Thanks for your contribution. I have a question: schema_str is already in the context and it’s very large. Is it necessary to write schema_str into the context again when retrying? |
|
Good question — short answer: it's defensive duplication for a recency reason, but I agree it's worth making conditional. Here's the rationale and a proposed tweak. Why the corrective message restates the schema instead of just referencing it The schema lives in the system message at index 0 ( The corrective injection is co-located with three things the model has to combine to recover: (a) its own bad output it just emitted, (b) the parse error, (c) the shape it should have produced. Splitting these across long context relies on the model to retrieve and re-anchor on the system prompt schema — which is precisely what failed. Inlining keeps "what went wrong" and "what right looks like" adjacent. The cost concern is fair though For typical extraction schemas (a few hundred bytes) the re-injection is small relative to the conversation tail. For deeply nested enterprise schemas (10KB+ with descriptions / examples / Proposal: schema-size threshold + reference fallback _MAX_INLINE_SCHEMA_CHARS = 4000 # configurable
def _build_correction_body(self, error, schema_str):
if len(schema_str) <= self._MAX_INLINE_SCHEMA_CHARS:
return f"...could not be parsed: {error}\nReturn ONLY a JSON object matching:\n```json\n{schema_str}\n```"
return f"...could not be parsed: {error}\nReturn ONLY a JSON object matching the schema in the system prompt — no prose, no markdown fences."Same logic for the last-iteration message. Defaults to inlining for small/medium schemas (where cost is negligible and recency benefit is highest) and degrades to a reference for very large schemas. I can push this as a follow-up commit if that direction works for you. Happy to also expose |
Summary
Fixes #1541 —
memcommitextracting 0 memories because the LLM returned plain conversational text instead of the schema-required JSON, then every retry exhaustedmax_iterationsrepeating the same drift.The flow that breaks today:
_call_llmcalls the model withtool_choice="auto". Weak models (the reporter'sLongCat-Flash-Thinking, but I've seen the same with smaller open models) sometimes ignore the system-prompt schema and respond with prose.parse_json_with_stabilityreturns(None, "Expected dict after parsing, got <class 'str'>")— the exact error from the bug report.(None, None)without appending the failed content tomessages.max_iterations, extraction completes with 0 memories.What this PR changes
extract_loop.py:438-448)_validate_operations(...)raisingValueError(operations parsed cleanly but pointed at disallowed URIs) now goes through the same correction path so the model can fix the URIs in iteration N+1. (extract_loop.py:444-448)extract_loop.py:191-203)_MAX_FAILED_CONTENT_CHARS = 1500caps how much of the failed response we copy into the next prompt — models occasionally produce multi-KB ramblings.This addresses solution #3 from the issue ("validation loop that asks LLM to retry if non-JSON response is detected") without invasive changes to the VLM backends. Solution #1 (
response_format) and solution #2 (force tool calling) would each touch every backend; persistence of failure context is a smaller, surgical fix that works regardless of provider.Tests
New
tests/session/memory/test_memory_react_invalid_response.py(8 tests, all green):_append_invalid_response_correction: 2-message append shape, schema embedded in correction, prose/markdown forbidden, oversized-content truncation, short-content untouched, empty/None content handled, embedded schema round-trips throughjson.loads._call_llm: a mock VLM returning plain prose triggers the failure-persistence path, leavingmessageswith the failed assistant content + corrective user message ready for iteration N+1.ruff checkandruff format --checkare clean on the changed files.Out of scope
response_formatsupport to VLM backends (issue solution Bump actions/download-artifact from 4 to 7 #1) — separate PR if maintainers want it.tool_choice="required"viable (issue solution Bump astral-sh/setup-uv from 4 to 7 #2) — bigger refactor; happy to follow up if you'd prefer that direction.Backwards compatibility
messagesentries appear only on the recovery path, so token usage is only affected for runs that were already failing.