Skip to content

Input Schema: Non ASCII characters are being escaped to Unicode #6279

Description

@ofirEdi

Describe the Bug:
When defining an Agent and including input_schema in order to serialize a Pydantic object non-latin characters are being escaped to unicode format. This behavior causes token bloating, slow LLM responses and in some cases LLM is not able to return a readable response and returns symbols blended with the origin language of the prompt. In my case, for a RAG solution I sent context in Hebrew to an Agent and each letter was replaced with 6 Unicode ascii letters which led to high number of prompt tokens and inconsistency with LLM response.

Steps to Reproduce:

  • pip install google-adk==2.2.0 google-genai==2.8.0
  • Sample of python code causing the issue (I added before_model_callback to verify that sent data is in Unicode). Any Agent using input_schema will reproduce that. adding some code to reference
class SearchAttempt(BaseModel):
    query: str
    surrounding_chunks: int
    result_count: int
    error_message: str | None = None

class NormalizedChunk(BaseModel):
    retrieval_index: int
    document_bucket: str
    document_file_name: str
    chunk_index: int
    distance: float | None = None
    title: str | None = None
    publication_date: str | None = None
    section_title: str | None = None
    section_summary: str | None = None
    text: str | None = None

class SearchResult(BaseModel):
    status: Literal["success", "error"]
    query: str
    surrounding_chunks: int
    normalized_chunks: list[NormalizedChunk] = Field(default_factory=list)
    error_message: str | None = None
    results_count: int

class EvaluationInput(BaseModel):
    original_query: str
    search_result: SearchResult
    attempts: list[SearchAttempt]



def debug_rag_evaluator_request(
    callback_context: CallbackContext,
    llm_request: LlmRequest,
) -> None:
    print("========== raw parts ==========")
    for content in llm_request.contents or []:
        for part in content.parts or []:
            if part.text:
                print(part.text[:5000])
    print("========== end raw parts ==========")

rag_evaluator_agent = Agent(
    name="rag_evaluator_agent",
    model=_agent_model(),
    input_schema=EvaluationInput,
    output_schema=RetrievalDecision,
    include_contents="none",
    before_model_callback=debug_rag_evaluator_request,
    generate_content_config=types.GenerateContentConfig(
        responseMimeType="application/json",
        temperature=0.0,
        max_output_tokens=8192
    ),
    instruction="""
......
""",
)

Expected Behavior:
It seems that for response_schema this issue was fixed on #2936 by putting ensure_ascii=False for output schema dumps logic. The same should be done for input schema or users should at least have the option to choose whether to use that flag.
The bad behavior arises from: _node_input_to_content which calls json.dumps and model_dump_json without ensure_ascii=False flag. I ended up monkey-patching the function in my service which causes ADK to send request with Hebrew letters properly but of course this is discouraged.

Observed Behavior:
Request for LLM becomes bloated (~6 times more tokens than with non escaped letters). LLM responses were very slow and not predictable (responses were huge with many repeating symbols, signs were blended with hebrew letters, etc).

Environment Details:

  • ADK Library Version (pip show google-adk): 2.2.0
  • Desktop OS:** [e.g., macOS, Linux, Windows]: Linux (WSL)
  • Python Version (python -V): 3.12

Model Information:

  • Are you using LiteLLM: No
  • Which model is being used: gemini-2.5-flash

How often has this issue occurred?:

  • Always (100%)

Metadata

Metadata

Assignees

Labels

core[Component] This issue is related to the core interface and implementation

Type

Fields

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions