fix(evals): improve kubevirt eval assertions and isolation#975
Draft
lyarwood wants to merge 5 commits intocontainers:mainfrom
Draft
fix(evals): improve kubevirt eval assertions and isolation#975lyarwood wants to merge 5 commits intocontainers:mainfrom
lyarwood wants to merge 5 commits intocontainers:mainfrom
Conversation
The verify step exited 0 even when the VM never reached the Ready condition, meaning the eval could pass without the agent's fix actually working. Replace the permissive polling loop with kubectl wait that fails the eval if the VM is not ready within 150 seconds. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
Remove the deprecated openai-agent and the redundant claude-code eval configs under evals/tasks/kubevirt/ since both are already covered by the root eval configs in evals/claude-code/ and evals/openai-agent/. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
The troubleshoot-vm task instructs the agent to use the vm-troubleshoot prompt but the eval assertions did not verify this. Add a promptsUsed assertion to ensure the agent actually uses the expected prompt. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
Add expected-tool labels to each kubevirt task and corresponding taskSets that assert the agent uses the correct tool for each operation (e.g. vm_create for creation tasks, vm_clone for cloning, vm_lifecycle for pause, resources_delete for deletion, and resources_create_or_update for snapshots, restores, and updates). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
Give each kubevirt eval task its own namespace (kvt-<task-name>) instead of sharing vm-test across all tasks. This prevents namespace collisions if tasks are ever run in parallel. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
lyarwood
commented
Mar 26, 2026
| toolPattern: ".*" | ||
| minToolCalls: 1 | ||
| maxToolCalls: 20 | ||
| - glob: ../tasks/kubevirt/*/*.yaml |
Contributor
Author
There was a problem hiding this comment.
ah sorry this was a mistake, this double runs everything, somehow missed this in my local test run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
promptsUsedassertion for troubleshoot-vm to verify the agent uses thevm-troubleshootpromptexpected-toollabels and per-tool assertions for all kubevirt tasks (e.g.vm_create,vm_clone,vm_lifecycle,resources_delete,resources_create_or_update)kvt-<task-name>) to prevent collisions during parallel execution