Skip to content

fix(evals): improve kubevirt eval assertions and isolation#975

Draft
lyarwood wants to merge 5 commits intocontainers:mainfrom
lyarwood:evals/kubevirt-improvements
Draft

fix(evals): improve kubevirt eval assertions and isolation#975
lyarwood wants to merge 5 commits intocontainers:mainfrom
lyarwood:evals/kubevirt-improvements

Conversation

@lyarwood
Copy link
Copy Markdown
Contributor

Summary

  • Fix troubleshoot-vm verify to actually fail when the VM doesn't become ready (was silently passing)
  • Remove redundant kubevirt-specific eval configs (openai-agent deprecated, claude-code covered by root evals)
  • Add promptsUsed assertion for troubleshoot-vm to verify the agent uses the vm-troubleshoot prompt
  • Add expected-tool labels and per-tool assertions for all kubevirt tasks (e.g. vm_create, vm_clone, vm_lifecycle, resources_delete, resources_create_or_update)
  • Use unique namespaces per task (kvt-<task-name>) to prevent collisions during parallel execution

The verify step exited 0 even when the VM never reached the Ready
condition, meaning the eval could pass without the agent's fix actually
working. Replace the permissive polling loop with kubectl wait that
fails the eval if the VM is not ready within 150 seconds.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
Remove the deprecated openai-agent and the redundant claude-code eval
configs under evals/tasks/kubevirt/ since both are already covered by
the root eval configs in evals/claude-code/ and evals/openai-agent/.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
The troubleshoot-vm task instructs the agent to use the vm-troubleshoot
prompt but the eval assertions did not verify this. Add a promptsUsed
assertion to ensure the agent actually uses the expected prompt.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
Add expected-tool labels to each kubevirt task and corresponding
taskSets that assert the agent uses the correct tool for each
operation (e.g. vm_create for creation tasks, vm_clone for cloning,
vm_lifecycle for pause, resources_delete for deletion, and
resources_create_or_update for snapshots, restores, and updates).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
Give each kubevirt eval task its own namespace (kvt-<task-name>) instead
of sharing vm-test across all tasks. This prevents namespace collisions
if tasks are ever run in parallel.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
toolPattern: ".*"
minToolCalls: 1
maxToolCalls: 20
- glob: ../tasks/kubevirt/*/*.yaml
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry this was a mistake, this double runs everything, somehow missed this in my local test run.

@lyarwood lyarwood marked this pull request as draft March 26, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant