Skip to content

fix(memory): self-correct when LLM returns non-JSON during memory extraction (#1541)#1833

Open
voidborne-d wants to merge 1 commit intovolcengine:mainfrom
voidborne-d:fix/extract-loop-non-json-recovery
Open

fix(memory): self-correct when LLM returns non-JSON during memory extraction (#1541)#1833
voidborne-d wants to merge 1 commit intovolcengine:mainfrom
voidborne-d:fix/extract-loop-non-json-recovery

Conversation

@voidborne-d
Copy link
Copy Markdown

Summary

Fixes #1541memcommit extracting 0 memories because the LLM returned plain conversational text instead of the schema-required JSON, then every retry exhausted max_iterations repeating the same drift.

The flow that breaks today:

  1. _call_llm calls the model with tool_choice="auto". Weak models (the reporter's LongCat-Flash-Thinking, but I've seen the same with smaller open models) sometimes ignore the system-prompt schema and respond with prose.
  2. parse_json_with_stability returns (None, "Expected dict after parsing, got <class 'str'>") — the exact error from the bug report.
  3. The current code logs a warning and returns (None, None) without appending the failed content to messages.
  4. Iteration N+1 sees the exact same prompt and produces the same plain-text drift. After max_iterations, extraction completes with 0 memories.

What this PR changes

  • Persist failure context. When parse fails, append the failed assistant content + a corrective user message that re-states the JSON schema and forbids prose. The next iteration now has concrete feedback to recover from instead of silently re-running the same prompt. (extract_loop.py:438-448)
  • Same recovery on URI validation errors. _validate_operations(...) raising ValueError (operations parsed cleanly but pointed at disallowed URIs) now goes through the same correction path so the model can fix the URIs in iteration N+1. (extract_loop.py:444-448)
  • Schema-anchored last-iteration prompt. The "max iterations reached" message now embeds the JSON schema directly and explicitly forbids prose / markdown fences / explanation, so even when no prior failures were appended the final attempt has a strong, schema-grounded instruction. (extract_loop.py:191-203)
  • Truncation safety. _MAX_FAILED_CONTENT_CHARS = 1500 caps how much of the failed response we copy into the next prompt — models occasionally produce multi-KB ramblings.

This addresses solution #3 from the issue ("validation loop that asks LLM to retry if non-JSON response is detected") without invasive changes to the VLM backends. Solution #1 (response_format) and solution #2 (force tool calling) would each touch every backend; persistence of failure context is a smaller, surgical fix that works regardless of provider.

Tests

New tests/session/memory/test_memory_react_invalid_response.py (8 tests, all green):

  • 7 unit tests on _append_invalid_response_correction: 2-message append shape, schema embedded in correction, prose/markdown forbidden, oversized-content truncation, short-content untouched, empty/None content handled, embedded schema round-trips through json.loads.
  • 1 integration test on _call_llm: a mock VLM returning plain prose triggers the failure-persistence path, leaving messages with the failed assistant content + corrective user message ready for iteration N+1.
$ PYTHONPATH=. pytest tests/session/memory/test_memory_react_invalid_response.py --noconftest --no-cov -q
8 passed, 4 warnings in 1.96s

ruff check and ruff format --check are clean on the changed files.

Out of scope

Backwards compatibility

  • Behaviour unchanged when the LLM returns valid JSON on the first try (which is the common-path).
  • The new messages entries appear only on the recovery path, so token usage is only affected for runs that were already failing.

…correct (volcengine#1541)

When the LLM returns plain prose instead of the operations schema (a
known failure mode for weaker models when `tool_choice="auto"` lets the
model pick text generation), `_call_llm` logged the warning and returned
`(None, None)` without appending the failed response to `messages`.

The next iteration then saw the *exact same* prompt and repeated the
same drift; after `max_iterations` the run ended with "Extracted 0
memories" — the symptom reported in volcengine#1541.

This change persists the failed assistant content + a corrective user
message that re-states the schema and forbids prose, so the next call
gives the model concrete context to recover from. The same path now
also catches `_validate_operations` ValueErrors (operations parsed but
URIs disallowed) so iteration N+1 can correct invalid URIs instead of
silently dropping the response.

The force-final-iteration prompt is also reworded to embed the JSON
schema directly, so the last attempt has a strong, schema-anchored
instruction even when no prior failures were appended.
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅

1541 - Fully compliant

Compliant requirements:

  • Fix memory extraction failure when LLM returns plain text instead of JSON
  • Handle case where parse_json_with_stability() fails to get a dict
  • Prevent retry loop from exhausting iterations with same failure
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Debug print statements

Uses print() statements instead of structured logging (loguru), which is inconsistent with the rest of the codebase and may leave debug output in production.

            print(f"content={content}")
            logger.warning(f"Failed to parse memory operations: {error}")
            # Persist the failed assistant content + a corrective user message
            # so the next iteration sees the failure and can self-correct
            # instead of repeating the same plain-text drift (fixes #1541).
            self._append_invalid_response_correction(messages, content, error)
            return (None, None)

        # Validate that all URIs are allowed
        try:
            self._validate_operations(operations)
        except ValueError as ve:
            self._append_invalid_response_correction(messages, content, str(ve))
            return (None, None)
        return (None, operations)
    except Exception as e:
        logger.exception(f"Error parsing operations: {e}")

# Case 3: No tool calls and no parsable operations
print("No tool calls or operations parsed")

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@chenjw chenjw self-requested a review May 6, 2026 12:55
@chenjw
Copy link
Copy Markdown
Collaborator

chenjw commented May 7, 2026

Thanks for your contribution. I have a question: schema_str is already in the context and it’s very large. Is it necessary to write schema_str into the context again when retrying?

@voidborne-d
Copy link
Copy Markdown
Author

Good question — short answer: it's defensive duplication for a recency reason, but I agree it's worth making conditional. Here's the rationale and a proposed tweak.

Why the corrective message restates the schema instead of just referencing it

The schema lives in the system message at index 0 (extract_loop.py:162-172), immediately followed by _mark_cache_breakpoint. After the first prefetch() plus N tool-call rounds, that schema text can be hundreds of messages away from the LLM's attention window tail. That's also the exact reason the bug exists: the model already had the schema in context when it drifted to prose. Just having the schema in context wasn't enough — the failure happened despite it.

The corrective injection is co-located with three things the model has to combine to recover: (a) its own bad output it just emitted, (b) the parse error, (c) the shape it should have produced. Splitting these across long context relies on the model to retrieve and re-anchor on the system prompt schema — which is precisely what failed. Inlining keeps "what went wrong" and "what right looks like" adjacent.

The cost concern is fair though

For typical extraction schemas (a few hundred bytes) the re-injection is small relative to the conversation tail. For deeply nested enterprise schemas (10KB+ with descriptions / examples / $defs), three re-injections inside a 5-iteration loop is real bloat — and worse, breaks the cache after the breakpoint.

Proposal: schema-size threshold + reference fallback

_MAX_INLINE_SCHEMA_CHARS = 4000  # configurable

def _build_correction_body(self, error, schema_str):
    if len(schema_str) <= self._MAX_INLINE_SCHEMA_CHARS:
        return f"...could not be parsed: {error}\nReturn ONLY a JSON object matching:\n```json\n{schema_str}\n```"
    return f"...could not be parsed: {error}\nReturn ONLY a JSON object matching the schema in the system prompt — no prose, no markdown fences."

Same logic for the last-iteration message. Defaults to inlining for small/medium schemas (where cost is negligible and recency benefit is highest) and degrades to a reference for very large schemas.

I can push this as a follow-up commit if that direction works for you. Happy to also expose _MAX_INLINE_SCHEMA_CHARS as a class attribute so downstream users with unusual schema sizes can tune it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug]: Memory extraction fails - LLM returns plain text instead of JSON structure

2 participants