fix(evals): improve kubevirt eval assertions and isolation by lyarwood · Pull Request #975 · containers/kubernetes-mcp-server

lyarwood · 2026-03-26T14:02:09Z

Summary

Fix troubleshoot-vm verify to actually fail when the VM doesn't become ready (was silently passing)
Remove redundant kubevirt-specific eval configs (openai-agent deprecated, claude-code covered by root evals)
Add promptsUsed assertion for troubleshoot-vm to verify the agent uses the vm-troubleshoot prompt
Add expected-tool labels and per-tool assertions for all kubevirt tasks (e.g. vm_create, vm_clone, vm_lifecycle, resources_delete, resources_create_or_update)
Use unique namespaces per task (kvt-<task-name>) to prevent collisions during parallel execution

The verify step exited 0 even when the VM never reached the Ready condition, meaning the eval could pass without the agent's fix actually working. Replace the permissive polling loop with kubectl wait that fails the eval if the VM is not ready within 150 seconds. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Lee Yarwood <lyarwood@redhat.com>

Remove the deprecated openai-agent and the redundant claude-code eval configs under evals/tasks/kubevirt/ since both are already covered by the root eval configs in evals/claude-code/ and evals/openai-agent/. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Lee Yarwood <lyarwood@redhat.com>

The troubleshoot-vm task instructs the agent to use the vm-troubleshoot prompt but the eval assertions did not verify this. Add a promptsUsed assertion to ensure the agent actually uses the expected prompt. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Lee Yarwood <lyarwood@redhat.com>

Add expected-tool labels to each kubevirt task and corresponding taskSets that assert the agent uses the correct tool for each operation (e.g. vm_create for creation tasks, vm_clone for cloning, vm_lifecycle for pause, resources_delete for deletion, and resources_create_or_update for snapshots, restores, and updates). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Lee Yarwood <lyarwood@redhat.com>

Give each kubevirt eval task its own namespace (kvt-<task-name>) instead of sharing vm-test across all tasks. This prevents namespace collisions if tasks are ever run in parallel. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Lee Yarwood <lyarwood@redhat.com>

lyarwood · 2026-03-26T14:04:23Z

evals/claude-code/eval.yaml

            toolPattern: ".*"
        minToolCalls: 1
        maxToolCalls: 20
+    - glob: ../tasks/kubevirt/*/*.yaml


ah sorry this was a mistake, this double runs everything, somehow missed this in my local test run.

lyarwood added 5 commits March 26, 2026 14:00

lyarwood commented Mar 26, 2026

View reviewed changes

lyarwood marked this pull request as draft March 26, 2026 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evals): improve kubevirt eval assertions and isolation#975

fix(evals): improve kubevirt eval assertions and isolation#975
lyarwood wants to merge 5 commits intocontainers:mainfrom
lyarwood:evals/kubevirt-improvements

lyarwood commented Mar 26, 2026

Uh oh!

lyarwood Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lyarwood commented Mar 26, 2026

Summary

Uh oh!

lyarwood Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant