JudgeAgent AttributeError when LLM returns criteria as string instead of dict

## Summary

`JudgeAgent` throws `AttributeError: 'str' object has no attribute 'values'` when the LLM returns the `criteria` field as a JSON string instead of a dictionary object. This is an intermittent issue that occurs when the LLM misinterprets the function calling schema.

## Environment

- **Package**: `langwatch-scenario`
- **Version**: 0.7.14
- **Python Version**: 3.13.4
- **OS**: macOS

## Description

When using `JudgeAgent` with a list of criteria, the agent sometimes fails with:

```
AttributeError: 'str' object has no attribute 'values'
```

This occurs in `scenario/judge_agent.py` at line 200 (and 205) when the code attempts to call `criteria.values()` on what it expects to be a dictionary, but is actually a string.

## Root Cause

The issue occurs in the `JudgeAgent.call()` method when parsing the LLM's function call response:

```python
# Line 193: Parse tool call arguments
args = json.loads(tool_call.function.arguments)
criteria = args.get("criteria", {})

# Line 200: Fails if criteria is a string
for idx, criterion in enumerate(criteria.values()):  # ❌ AttributeError if criteria is str
    ...
```

**Expected behavior**: The LLM should return `criteria` as a dictionary object:
```json
{
  "criteria": {
    "criterion_0": "true",
    "criterion_1": "false"
  }
}
```

**Actual behavior (when bug occurs)**: The LLM sometimes returns `criteria` as a JSON string:
```json
{
  "criteria": "{\"criterion_0\": \"true\", \"criterion_1\": \"false\"}"
}
```

When `json.loads()` parses the outer JSON, `criteria` remains a string instead of being parsed as a dictionary.

## Why This Happens

1. **Complex Dynamic Schema**: The function schema uses dynamically generated property names (sanitized criterion text, truncated to 70 chars), which can confuse the LLM
2. **Schema Ambiguity**: With many criteria, the nested object structure may be misinterpreted
3. **LLM Behavior**: Some LLM models serialize nested objects as JSON strings when uncertain about the schema format
4. **No Validation**: The code doesn't validate the type of `criteria` before calling `.values()`

## Steps to Reproduce

1. Create a `JudgeAgent` with multiple criteria:
```python
import scenario

judge = scenario.JudgeAgent(
    criteria=[
        "Agent must provide accurate information",
        "Agent must not show error messages",
        "If data is unavailable, Agent must acknowledge this explicitly",
    ]
)
```

2. Use the judge in a scenario that runs multiple times
3. The error occurs intermittently when the LLM returns `criteria` as a string

## Expected Behavior

The code should handle both cases:
- When `criteria` is a dictionary (normal case)
- When `criteria` is a JSON string (edge case that needs parsing)

## Proposed Fix

Add defensive parsing in `scenario/judge_agent.py` around line 196:

```python
criteria = args.get("criteria", {})

# Add defensive parsing for string case
if isinstance(criteria, str):
    try:
        criteria = json.loads(criteria)  # Parse if it's a JSON string
    except json.JSONDecodeError:
        criteria = {}  # Fallback to empty dict
elif not isinstance(criteria, dict):
    criteria = {}  # Ensure it's a dict

# Now safely use criteria.values()
for idx, criterion in enumerate(criteria.values()):
    ...
```

## Impact

- **Severity**: Medium - Causes test failures but is intermittent
- **Frequency**: Intermittent - depends on LLM response format
- **Workaround**: None currently available (would require monkey-patching the library)

## Additional Context

The schema definition for `criteria` uses a dictionary comprehension to create dynamic properties:

```python
"criteria": {
    "type": "object",
    "properties": {
        criteria_names[idx]: {
            "type": "string",
            "enum": ["true", "false", "inconclusive"],
            "description": criterion,
        }
        for idx, criterion in enumerate(self.criteria)
    },
    "required": criteria_names,
    "additionalProperties": False,
    "description": "Strict verdict for each criterion",
}
```

The dynamic property names (sanitized and truncated criterion text) may contribute to the LLM's confusion about the expected format.

## Related Code Location

- File: `scenario/judge_agent.py`
- Method: `JudgeAgent.call()`
- Lines: ~193-200, ~205


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JudgeAgent AttributeError when LLM returns criteria as string instead of dict #161

Summary

Environment

Description

Root Cause

Why This Happens

Steps to Reproduce

Expected Behavior

Proposed Fix

Impact

Additional Context

Related Code Location

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JudgeAgent AttributeError when LLM returns criteria as string instead of dict #161

Description

Summary

Environment

Description

Root Cause

Why This Happens

Steps to Reproduce

Expected Behavior

Proposed Fix

Impact

Additional Context

Related Code Location

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions