fix(memory): self-correct when LLM returns non-JSON during memory extraction (#1541) by voidborne-d · Pull Request #1833 · volcengine/OpenViking

voidborne-d · 2026-04-30T21:42:56Z

Summary

Fixes #1541 — memcommit extracting 0 memories because the LLM returned plain conversational text instead of the schema-required JSON, then every retry exhausted max_iterations repeating the same drift.

The flow that breaks today:

_call_llm calls the model with tool_choice="auto". Weak models (the reporter's LongCat-Flash-Thinking, but I've seen the same with smaller open models) sometimes ignore the system-prompt schema and respond with prose.
parse_json_with_stability returns (None, "Expected dict after parsing, got <class 'str'>") — the exact error from the bug report.
The current code logs a warning and returns (None, None) without appending the failed content to messages.
Iteration N+1 sees the exact same prompt and produces the same plain-text drift. After max_iterations, extraction completes with 0 memories.

What this PR changes

Persist failure context. When parse fails, append the failed assistant content + a corrective user message that re-states the JSON schema and forbids prose. The next iteration now has concrete feedback to recover from instead of silently re-running the same prompt. (extract_loop.py:438-448)
Same recovery on URI validation errors. _validate_operations(...) raising ValueError (operations parsed cleanly but pointed at disallowed URIs) now goes through the same correction path so the model can fix the URIs in iteration N+1. (extract_loop.py:444-448)
Schema-anchored last-iteration prompt. The "max iterations reached" message now embeds the JSON schema directly and explicitly forbids prose / markdown fences / explanation, so even when no prior failures were appended the final attempt has a strong, schema-grounded instruction. (extract_loop.py:191-203)
Truncation safety. _MAX_FAILED_CONTENT_CHARS = 1500 caps how much of the failed response we copy into the next prompt — models occasionally produce multi-KB ramblings.

This addresses solution #3 from the issue ("validation loop that asks LLM to retry if non-JSON response is detected") without invasive changes to the VLM backends. Solution #1 (response_format) and solution #2 (force tool calling) would each touch every backend; persistence of failure context is a smaller, surgical fix that works regardless of provider.

Tests

New tests/session/memory/test_memory_react_invalid_response.py (8 tests, all green):

7 unit tests on _append_invalid_response_correction: 2-message append shape, schema embedded in correction, prose/markdown forbidden, oversized-content truncation, short-content untouched, empty/None content handled, embedded schema round-trips through json.loads.
1 integration test on _call_llm: a mock VLM returning plain prose triggers the failure-persistence path, leaving messages with the failed assistant content + corrective user message ready for iteration N+1.

$ PYTHONPATH=. pytest tests/session/memory/test_memory_react_invalid_response.py --noconftest --no-cov -q
8 passed, 4 warnings in 1.96s

ruff check and ruff format --check are clean on the changed files.

Out of scope

Adding response_format support to VLM backends (issue solution Bump actions/download-artifact from 4 to 7 #1) — separate PR if maintainers want it.
Adding a "submit_operations" tool to make tool_choice="required" viable (issue solution Bump astral-sh/setup-uv from 4 to 7 #2) — bigger refactor; happy to follow up if you'd prefer that direction.

Backwards compatibility

Behaviour unchanged when the LLM returns valid JSON on the first try (which is the common-path).
The new messages entries appear only on the recovery path, so token usage is only affected for runs that were already failing.

…correct (volcengine#1541) When the LLM returns plain prose instead of the operations schema (a known failure mode for weaker models when `tool_choice="auto"` lets the model pick text generation), `_call_llm` logged the warning and returned `(None, None)` without appending the failed response to `messages`. The next iteration then saw the *exact same* prompt and repeated the same drift; after `max_iterations` the run ended with "Extracted 0 memories" — the symptom reported in volcengine#1541. This change persists the failed assistant content + a corrective user message that re-states the schema and forbids prose, so the next call gives the model concrete context to recover from. The same path now also catches `_validate_operations` ValueErrors (operations parsed but URIs disallowed) so iteration N+1 can correct invalid URIs instead of silently dropping the response. The force-final-iteration prompt is also reworded to embed the JSON schema directly, so the last attempt has a strong, schema-anchored instruction even when no prior failures were appended.

github-actions · 2026-04-30T21:43:51Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅ 1541 - Fully compliant Compliant requirements: Fix memory extraction failure when LLM returns plain text instead of JSON Handle case where `parse_json_with_stability()` fails to get a dict Prevent retry loop from exhausting iterations with same failure
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Debug print statements Uses `print()` statements instead of structured logging (loguru), which is inconsistent with the rest of the codebase and may leave debug output in production. print(f"content={content}") logger.warning(f"Failed to parse memory operations: {error}") # Persist the failed assistant content + a corrective user message # so the next iteration sees the failure and can self-correct # instead of repeating the same plain-text drift (fixes #1541). self._append_invalid_response_correction(messages, content, error) return (None, None) # Validate that all URIs are allowed try: self._validate_operations(operations) except ValueError as ve: self._append_invalid_response_correction(messages, content, str(ve)) return (None, None) return (None, operations) except Exception as e: logger.exception(f"Error parsing operations: {e}") # Case 3: No tool calls and no parsable operations print("No tool calls or operations parsed")

github-actions · 2026-04-30T21:44:36Z

PR Code Suggestions ✨

No code suggestions found for the PR.

chenjw · 2026-05-07T04:26:43Z

Thanks for your contribution. I have a question: schema_str is already in the context and it’s very large. Is it necessary to write schema_str into the context again when retrying?

voidborne-d · 2026-05-07T07:57:04Z

Good question — short answer: it's defensive duplication for a recency reason, but I agree it's worth making conditional. Here's the rationale and a proposed tweak.

Why the corrective message restates the schema instead of just referencing it

The schema lives in the system message at index 0 (extract_loop.py:162-172), immediately followed by _mark_cache_breakpoint. After the first prefetch() plus N tool-call rounds, that schema text can be hundreds of messages away from the LLM's attention window tail. That's also the exact reason the bug exists: the model already had the schema in context when it drifted to prose. Just having the schema in context wasn't enough — the failure happened despite it.

The corrective injection is co-located with three things the model has to combine to recover: (a) its own bad output it just emitted, (b) the parse error, (c) the shape it should have produced. Splitting these across long context relies on the model to retrieve and re-anchor on the system prompt schema — which is precisely what failed. Inlining keeps "what went wrong" and "what right looks like" adjacent.

The cost concern is fair though

For typical extraction schemas (a few hundred bytes) the re-injection is small relative to the conversation tail. For deeply nested enterprise schemas (10KB+ with descriptions / examples / $defs), three re-injections inside a 5-iteration loop is real bloat — and worse, breaks the cache after the breakpoint.

Proposal: schema-size threshold + reference fallback

_MAX_INLINE_SCHEMA_CHARS = 4000  # configurable

def _build_correction_body(self, error, schema_str):
    if len(schema_str) <= self._MAX_INLINE_SCHEMA_CHARS:
        return f"...could not be parsed: {error}\nReturn ONLY a JSON object matching:\n```json\n{schema_str}\n```"
    return f"...could not be parsed: {error}\nReturn ONLY a JSON object matching the schema in the system prompt — no prose, no markdown fences."

Same logic for the last-iteration message. Defaults to inlining for small/medium schemas (where cost is negligible and recency benefit is highest) and degrades to a reference for very large schemas.

I can push this as a follow-up commit if that direction works for you. Happy to also expose _MAX_INLINE_SCHEMA_CHARS as a class attribute so downstream users with unusual schema sizes can tune it.

github-project-automation Bot added this to OpenViking project Apr 30, 2026

github-project-automation Bot moved this to Backlog in OpenViking project Apr 30, 2026

github-actions Bot added the Review effort 2/5 label Apr 30, 2026

chenjw self-requested a review May 6, 2026 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): self-correct when LLM returns non-JSON during memory extraction (#1541)#1833

fix(memory): self-correct when LLM returns non-JSON during memory extraction (#1541)#1833
voidborne-d wants to merge 1 commit intovolcengine:mainfrom
voidborne-d:fix/extract-loop-non-json-recovery

voidborne-d commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

chenjw commented May 7, 2026

Uh oh!

voidborne-d commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

voidborne-d commented Apr 30, 2026

Summary

What this PR changes

Tests

Out of scope

Backwards compatibility

Uh oh!

github-actions Bot commented Apr 30, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented Apr 30, 2026

PR Code Suggestions ✨

Uh oh!

chenjw commented May 7, 2026

Uh oh!

voidborne-d commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants