Add agentic skills for fixing Jetlag GH issues and testing them in Prow jobs#788
Open
mcornea wants to merge 4 commits intoredhat-performance:mainfrom
Open
Add agentic skills for fixing Jetlag GH issues and testing them in Prow jobs#788mcornea wants to merge 4 commits intoredhat-performance:mainfrom
mcornea wants to merge 4 commits intoredhat-performance:mainfrom
Conversation
Four independent Claude Code skills that automate the full loop from reading a GitHub issue to implementing a fix, creating a PR, triggering Prow CI tests, and analyzing failures: - /jetlag-issue-fix: orchestrator driving the full 5-phase loop (understand, plan, implement, test, evaluate) with human gates - /jetlag-test-map: maps changed files to appropriate Prow test jobs with cross-repo detection for openshift/release changes - /jetlag-prow-trigger: triggers tests on PRs, polls for results with 5-min intervals and 3-hour timeout - /jetlag-prow-analyze: analyzes failed Prow job logs, correlates failures with PR changes, classifies root causes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document that make ci-operator-config and make jobs must be run before pushing openshift/release companion PRs to regenerate config metadata and prow job files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
The jetlag-issue-fix orchestrator now explicitly references /jetlag-test-map, /jetlag-prow-trigger, and /jetlag-prow-analyze via the Skill tool instead of inlining their logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adding Claude skills that automate the full loop from reading GitHub issue, raising PRs with fixes and testing them in existing Prow jobs in the Jetlag PR or it can also create new Prow jobs for cases where the Jetlag change is not covered by existing jobs.
This is an example with the PRs created by Claude for addressing #765 : #787 and openshift/release#76604
Below is a more detailed explanation of what each skill is doing:
Architecture
/jetlag-issue-fix <issue#>-- OrchestratorFile:
.claude/commands/jetlag-issue-fix.mdThe main driver skill that takes a GitHub issue number and walks through 5 phases with mandatory human confirmation gates at each step.
Phase 1: UNDERSTAND
Fetches the issue from
redhat-performance/jetlagviagh issue view, extracts error messages, affected cluster type (MNO/SNO/VMNO/hybrid), lab environment (scalelab/performancelab/ibmcloud), hardware type (r750/r660/r650/r640/etc.), and referenced files. Explores the code paths in the repo, checksgit blamefor recent changes, and presents an analysis for user review.Human Gate 1: User must confirm before proceeding to planning.
Phase 2: PLAN
Identifies files to change and applies the test-map logic to determine which Prow jobs cover the change. This includes:
openshift/releasechanges)Presents a fix plan with code changes, test plan, and branch name.
Human Gate 2: User must confirm before implementation begins.
Phase 3: IMPLEMENT
main(fix/issue-<N>-<short-desc>)Fixes #<N>reference and co-author attributionredhat-performance/jetlagwith a test plan checklist in the bodyPhase 4: TEST
Two flows depending on whether existing tests cover the change:
Standard Flow:
/test <job>PR commentgh apicommit statuses every 5 minutesCross-Repo Flow (when a new test job is needed in
openshift/release):openshift/releasefrom upstream (not from a potentially stale fork)ci-operator/config/redhat-performance/jetlag/redhat-performance-jetlag-main.yaml(test entry)ci-operator/step-registry/openshift-qe/installer/bm/deploy/openshift-qe-installer-bm-deploy-ref.yaml(env var declaration)ci-operator/step-registry/openshift-qe/installer/bm/deploy/openshift-qe-installer-bm-deploy-commands.sh(script logic)make ci-operator-configandmake jobsto regenerate required config metadata and prow job files (these checks fail without this step)JETLAG_PRset to test the jetlag branchopenshift/releasePRHuman Gate 3: User must confirm before any test is triggered (bare-metal CI is expensive, ~1-2 hours per job).
Phase 5: EVALUATE
/lgtm+/approvefor humans./retest. Does NOT count toward the 3-iteration limit.Human Gate 4: User must confirm before each fix iteration.
Safety Rails
/lgtm+/approve/jetlag-test-map [PR#]-- Test Selection EngineFile:
.claude/commands/jetlag-test-map.mdMaps changed files to the appropriate Prow CI test jobs. Can be used standalone or is called inline by the orchestrator.
Input
gh pr diff --name-onlygit diff --name-only mainfrom the current branchCore Mapping Table
ansible/roles/bastion-*/**deploy-snodeploy-mnoansible/roles/create-inventory/**deploy-snodeploy-mnoansible/roles/validate-vars/**deploy-snodeploy-mnoansible/roles/install-cluster/**deploy-snodeploy-mnoansible/roles/boot-iso/**deploy-snodeploy-mnoansible/roles/wait-hosts-discovered/**deploy-snodeploy-mnoansible/roles/create-ai-cluster/**deploy-snodeploy-mnoansible/roles/generate-discovery-iso/**deploy-snodeploy-mnoansible/roles/sno-post-cluster-install/**deploy-snoansible/roles/mno-post-cluster-install/**deploy-mnodeploy-cmnoansible/roles/hv-*/**deploy-vmnodeploy-hmnoansible/roles/ocp-scale-out*/**deploy-mno-scaleoutdeploy-sno-scaleoutansible/roles/badfish/**deploy-snodeploy-mnoansible/mno-deploy.ymldeploy-mnodeploy-cmnoansible/sno-deploy.ymldeploy-snoansible/hv-setup.yml/hv-vm-create.ymldeploy-vmnodeploy-hmnoansible/ocp-scale-out.ymldeploy-mno-scaleoutansible/vars/lab.ymldeploy-snodeploy-mnoansible/vars/all.sample.ymldeploy-snodeploy-mnodocs/**,*.md,CLAUDE.mdbootstrap.shFeature Modifiers
Scans the actual diff content (not just filenames) for patterns and adds variant jobs:
enable_bondor bond logic -> adds*-private-bondvariantspublic_vlanlogic -> adds*-privatevariantshybrid_worker_count-> addsdeploy-hmnocluster_type == "vmno"or vmno logic -> addsdeploy-vmno*-fipsvariants if they existCross-Repo Detection
Checks if new variables are added to
all.sample.ymlthat aren't in the known set of CI env vars:TYPE,NUM_WORKER_NODES,NUM_HYBRID_WORKER_NODES,PUBLIC_VLAN,BOND,FIPS,JETLAG_PR,JETLAG_LATEST,JETLAG_BRANCH,OCP_BUILD,OCP_VERSION,DISCONNECTEDIf a new gating variable is found:
openshift/releasecompanion PRmake ci-operator-configandmake jobsmust be run before pushing (requirespodmanordocker)Output
Structured recommendation with sections for minimum tests, extra coverage, feature-specific tests, and cross-repo notes.
/jetlag-prow-trigger <PR#> [job-name]-- Test Trigger + MonitorFile:
.claude/commands/jetlag-prow-trigger.mdTriggers Prow CI tests on PRs and polls until completion with inline progress reporting.
Input Formats
123 deploy-sno-- trigger a specific job on jetlag PR Update kube-burner install to v0.14.2 #123123-- auto-detect jobs via test-map logic, ask user to confirm--release-pr 456 deploy-foo-- trigger on anopenshift/releasePRWhen targeting
openshift/releasePRs, the PR must have passedci-operator-config-metadataandgenerated-configchecks first (requiresmake ci-operator-configandmake jobs).Pre-flight Checks
gh pr view --json headRefOidok-to-testlabel (required for non-org members)Trigger
Always asks for user confirmation first -- bare-metal CI is expensive and takes 1-2 hours per job. Comments
/test <job>on the PR. Only triggers one job at a time due to hardware constraints (single bare-metal allocation shared across all jetlag CI).Monitor
gh api repos/.../commits/<SHA>/statusespending-> "Test still running... (elapsed: Xm)"success-> "Test PASSED"failure-> "Test FAILED -- analyzing..."error-> "Test ERROR (infra issue)"On Completion
Success: Reports pass, asks about triggering additional recommended tests from the test-map.
Failure:
target_urlfinished.jsonandbuild-log.txtof the failed step/jetlag-prow-analyzeInfra Error: Classifies failures in ping/poweroff/BMC access steps as hardware flakes, suggests
/retest <job>./jetlag-prow-analyze <prow-url>-- Failure AnalysisFile:
.claude/commands/jetlag-prow-analyze.mdAnalyzes failed Prow job logs to identify root causes and suggest fixes. Can be used standalone on any Prow job URL.
Step 1: Parse URL
Extracts from the Prow URL:
deploy-sno,deploy-mno)Handles both PR jobs (
pr-logs/pull/) and periodic/batch jobs (logs/).Step 2: Fetch Artifacts
Downloads via WebFetch from the GCS artifacts bucket:
finished.json-- overall result and timestampbuild-log.txt-- full build logartifacts/<step-name>/build-log.txt(e.g.,openshift-qe-installer-bm-deploy)Step 3: Identify Failure Point
Scans for Ansible failure patterns in priority order:
fatal: [hostname]: FAILED!-- extracts task name (fromTASK [role : task name]line), errormsg:field, and failed hostPLAY RECAPwithfailed>0-- identifies which host(s) failed and at which playMODULE FAILURE-- module-level crash with tracebackERROR!-- Ansible-level errors (syntax, undefined variable, unreachable host)error:/Error:lines, Python tracebacks, shell command failures (exit code != 0)Step 4: Correlate with PR Changes
For PR jobs:
gh pr diff --name-onlyStep 5: Classify and Recommend
Five classification categories:
/retestto confirm unrelatedunreachable/timed out/connection refusederrors/retest <job>-- don't count as code failure/retestStep 6: Leverage Existing Prow Skills
For deeper OCP-level analysis beyond Ansible failures, can delegate to the
prow-job:*skills from openshift-eng/ai-helpers (if installed):/prow-job:analyze-test-failure-- structured test failure analysis from JUnit and console logs/prow-job:extract-must-gather-- extract and browse must-gather archives from job artifacts/prow-job:analyze-resource-- Kubernetes resource lifecycle analysis from audit/pod logsThese are generic OpenShift CI tools, not Ansible-aware. Use them when the failure is in OCP cluster operations (operator failures, resource issues) rather than in Jetlag's Ansible playbooks.
Output
Structured report: