Add agentic skills for fixing Jetlag GH issues and testing them in Prow jobs by mcornea · Pull Request #788 · redhat-performance/jetlag

mcornea · 2026-03-20T15:19:23Z

Adding Claude skills that automate the full loop from reading GitHub issue, raising PRs with fixes and testing them in existing Prow jobs in the Jetlag PR or it can also create new Prow jobs for cases where the Jetlag change is not covered by existing jobs.

This is an example with the PRs created by Claude for addressing #765 : #787 and openshift/release#76604

Below is a more detailed explanation of what each skill is doing:

Architecture

/jetlag-issue-fix <issue#>          (orchestrator - drives the full loop)
    |
    +-- /jetlag-test-map             (maps changed files to Prow test jobs)
    +-- /jetlag-prow-trigger <PR#>   (triggers tests, monitors, fetches results)
    +-- /jetlag-prow-analyze <URL>   (analyzes failed Prow job logs)

Each skill works standalone:
  /jetlag-test-map                  -> give it a PR# or file list, get test recommendations
  /jetlag-prow-trigger 123          -> trigger + poll a test on PR #123
  /jetlag-prow-analyze <prow-url>   -> analyze any failed Prow job

`/jetlag-issue-fix <issue#>` -- Orchestrator

File: .claude/commands/jetlag-issue-fix.md

The main driver skill that takes a GitHub issue number and walks through 5 phases with mandatory human confirmation gates at each step.

Phase 1: UNDERSTAND

Fetches the issue from redhat-performance/jetlag via gh issue view, extracts error messages, affected cluster type (MNO/SNO/VMNO/hybrid), lab environment (scalelab/performancelab/ibmcloud), hardware type (r750/r660/r650/r640/etc.), and referenced files. Explores the code paths in the repo, checks git blame for recent changes, and presents an analysis for user review.

Human Gate 1: User must confirm before proceeding to planning.

Phase 2: PLAN

Identifies files to change and applies the test-map logic to determine which Prow jobs cover the change. This includes:

Mapping changed files to test jobs using the path-to-job table
Checking for feature modifiers (bond, public_vlan, hybrid, vmno)
Detecting cross-repo needs (new variables requiring openshift/release changes)

Presents a fix plan with code changes, test plan, and branch name.

Human Gate 2: User must confirm before implementation begins.

Phase 3: IMPLEMENT

Creates a branch from main (fix/issue-<N>-<short-desc>)
Makes the code changes as planned
Commits with Fixes #<N> reference and co-author attribution
Pushes to origin
Creates a PR against redhat-performance/jetlag with a test plan checklist in the body

Phase 4: TEST

Two flows depending on whether existing tests cover the change:

Standard Flow:

Triggers tests via /test <job> PR comment
Polls gh api commit statuses every 5 minutes
3-hour timeout per job
Reports progress inline

Cross-Repo Flow (when a new test job is needed in openshift/release):

Creates the Jetlag PR first
Clones openshift/release from upstream (not from a potentially stale fork)
Makes changes to 3 files:
- ci-operator/config/redhat-performance/jetlag/redhat-performance-jetlag-main.yaml (test entry)
- ci-operator/step-registry/openshift-qe/installer/bm/deploy/openshift-qe-installer-bm-deploy-ref.yaml (env var declaration)
- ci-operator/step-registry/openshift-qe/installer/bm/deploy/openshift-qe-installer-bm-deploy-commands.sh (script logic)
Runs make ci-operator-config and make jobs to regenerate required config metadata and prow job files (these checks fail without this step)
Creates a companion PR with JETLAG_PR set to test the jetlag branch
Triggers tests from the openshift/release PR

Human Gate 3: User must confirm before any test is triggered (bare-metal CI is expensive, ~1-2 hours per job).

Phase 5: EVALUATE

Pass: Checks off tests in the PR body, asks about triggering additional recommended tests. When all pass, declares success but never merges -- leaves /lgtm + /approve for humans.
Infra Error: Classifies failures in ping/poweroff/BMC steps as hardware flakes. Suggests /retest. Does NOT count toward the 3-iteration limit.
Code Failure: Analyzes logs, proposes a fix, asks for user confirmation before pushing a new commit (always a new commit, never force-push or amend). Resumes polling after re-trigger.
Escalation: After 3 failed iterations, presents the full iteration history and stops.

Human Gate 4: User must confirm before each fix iteration.

Safety Rails

Max 3 fix iterations before escalating to human
Max 3-hour timeout per test job
Never trigger >1 test simultaneously (bare-metal hardware constraint)
Always add new commits (never force-push, never amend published commits)
Always reference issue number in commits and PR body
Never merges -- presents results for human /lgtm + /approve
Never skips human gates

`/jetlag-test-map [PR#]` -- Test Selection Engine

File: .claude/commands/jetlag-test-map.md

Maps changed files to the appropriate Prow CI test jobs. Can be used standalone or is called inline by the orchestrator.

Input

A PR number: fetches diff via gh pr diff --name-only
No argument: uses git diff --name-only main from the current branch
Comma-separated file list: uses those files directly

Core Mapping Table

Changed Path Pattern	Minimum Test	Extra Coverage
`ansible/roles/bastion-/*`	`deploy-sno`	`deploy-mno`
`ansible/roles/create-inventory/**`	`deploy-sno`	`deploy-mno`
`ansible/roles/validate-vars/**`	`deploy-sno`	`deploy-mno`
`ansible/roles/install-cluster/**`	`deploy-sno`	`deploy-mno`
`ansible/roles/boot-iso/**`	`deploy-sno`	`deploy-mno`
`ansible/roles/wait-hosts-discovered/**`	`deploy-sno`	`deploy-mno`
`ansible/roles/create-ai-cluster/**`	`deploy-sno`	`deploy-mno`
`ansible/roles/generate-discovery-iso/**`	`deploy-sno`	`deploy-mno`
`ansible/roles/sno-post-cluster-install/**`	`deploy-sno`	--
`ansible/roles/mno-post-cluster-install/**`	`deploy-mno`	`deploy-cmno`
`ansible/roles/hv-/*`	`deploy-vmno`	`deploy-hmno`
`ansible/roles/ocp-scale-out/*`	`deploy-mno-scaleout`	`deploy-sno-scaleout`
`ansible/roles/badfish/**`	`deploy-sno`	`deploy-mno`
`ansible/mno-deploy.yml`	`deploy-mno`	`deploy-cmno`
`ansible/sno-deploy.yml`	`deploy-sno`	--
`ansible/hv-setup.yml` / `hv-vm-create.yml`	`deploy-vmno`	`deploy-hmno`
`ansible/ocp-scale-out.yml`	`deploy-mno-scaleout`	--
`ansible/vars/lab.yml`	`deploy-sno`	`deploy-mno`
`ansible/vars/all.sample.yml`	`deploy-sno`	`deploy-mno`
`docs/*`, `.md`, `CLAUDE.md`	No test needed	--
`bootstrap.sh`	No test needed	--

Feature Modifiers

Scans the actual diff content (not just filenames) for patterns and adds variant jobs:

enable_bond or bond logic -> adds *-private-bond variants
public_vlan logic -> adds *-private variants
hybrid_worker_count -> adds deploy-hmno
cluster_type == "vmno" or vmno logic -> adds deploy-vmno
FIPS-related logic -> adds *-fips variants if they exist

Cross-Repo Detection

Checks if new variables are added to all.sample.yml that aren't in the known set of CI env vars:

Known: TYPE, NUM_WORKER_NODES, NUM_HYBRID_WORKER_NODES, PUBLIC_VLAN, BOND, FIPS, JETLAG_PR, JETLAG_LATEST, JETLAG_BRANCH, OCP_BUILD, OCP_VERSION, DISCONNECTED

If a new gating variable is found:

Flags the need for an openshift/release companion PR
Provides draft YAML snippets for the test entry, env var declaration, and script modification
Notes that make ci-operator-config and make jobs must be run before pushing (requires podman or docker)

Output

Structured recommendation with sections for minimum tests, extra coverage, feature-specific tests, and cross-repo notes.

`/jetlag-prow-trigger <PR#> [job-name]` -- Test Trigger + Monitor

File: .claude/commands/jetlag-prow-trigger.md

Triggers Prow CI tests on PRs and polls until completion with inline progress reporting.

Input Formats

123 deploy-sno -- trigger a specific job on jetlag PR Update kube-burner install to v0.14.2 #123
123 -- auto-detect jobs via test-map logic, ask user to confirm
--release-pr 456 deploy-foo -- trigger on an openshift/release PR

When targeting openshift/release PRs, the PR must have passed ci-operator-config-metadata and generated-config checks first (requires make ci-operator-config and make jobs).

Pre-flight Checks

Gets the HEAD SHA via gh pr view --json headRefOid
Checks for ok-to-test label (required for non-org members)
Checks if a test is already running or completed on that SHA

Trigger

Always asks for user confirmation first -- bare-metal CI is expensive and takes 1-2 hours per job. Comments /test <job> on the PR. Only triggers one job at a time due to hardware constraints (single bare-metal allocation shared across all jetlag CI).

Monitor

Waits 2 minutes for initial status to appear
Polls every 5 minutes via gh api repos/.../commits/<SHA>/statuses
Reports each poll result inline:
- pending -> "Test still running... (elapsed: Xm)"
- success -> "Test PASSED"
- failure -> "Test FAILED -- analyzing..."
- error -> "Test ERROR (infra issue)"
Times out after 3 hours

On Completion

Success: Reports pass, asks about triggering additional recommended tests from the test-map.

Failure:

Extracts Prow URL from the status target_url

Derives the GCS artifacts path:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/
pr-logs/pull/redhat-performance_jetlag/<PR>/
pull-ci-redhat-performance-jetlag-main-<job>/<run-id>/

Fetches finished.json and build-log.txt of the failed step
Presents a summary with failed step, log excerpt, and pointer to /jetlag-prow-analyze

Infra Error: Classifies failures in ping/poweroff/BMC access steps as hardware flakes, suggests /retest <job>.

`/jetlag-prow-analyze <prow-url>` -- Failure Analysis

File: .claude/commands/jetlag-prow-analyze.md

Analyzes failed Prow job logs to identify root causes and suggest fixes. Can be used standalone on any Prow job URL.

Step 1: Parse URL

Extracts from the Prow URL:

PR number (from path)
Job name (e.g., deploy-sno, deploy-mno)
Run ID (build number)
GCS base path

Handles both PR jobs (pr-logs/pull/) and periodic/batch jobs (logs/).

Step 2: Fetch Artifacts

Downloads via WebFetch from the GCS artifacts bucket:

finished.json -- overall result and timestamp
build-log.txt -- full build log
Step-level logs at artifacts/<step-name>/build-log.txt (e.g., openshift-qe-installer-bm-deploy)

Step 3: Identify Failure Point

Scans for Ansible failure patterns in priority order:

fatal: [hostname]: FAILED! -- extracts task name (from TASK [role : task name] line), error msg: field, and failed host
PLAY RECAP with failed>0 -- identifies which host(s) failed and at which play
MODULE FAILURE -- module-level crash with traceback
ERROR! -- Ansible-level errors (syntax, undefined variable, unreachable host)
Non-Ansible failures -- error:/Error: lines, Python tracebacks, shell command failures (exit code != 0)

Step 4: Correlate with PR Changes

For PR jobs:

Gets the PR's changed files via gh pr diff --name-only
Compares the failed task/role against the changed files
Classifies:
- Changed code failure: failed task/role is in PR's modified files -> likely our bug
- Unrelated failure: failed task/role is NOT in modified files -> pre-existing issue or infra
- Infra failure: failure in hardware/network operations -> flake

Step 5: Classify and Recommend

Five classification categories:

Category	Indicators	Recommendation
Code Bug	Failure in changed code	Show relevant diff, suggest specific fix, push fix commit
Pre-existing Bug	Failure in unchanged code	File separate issue, `/retest` to confirm unrelated
Infra Flake	ping/poweroff/ipmitool/racadm/badfish failures; `unreachable`/`timed out`/`connection refused` errors	`/retest <job>` -- don't count as code failure
Timeout	Cluster install or host discovery exceeded limits	Check if PR introduces slower operations, or `/retest`
Configuration Error	Undefined variable, missing file, bad YAML syntax	Fix the syntax/variable issue

Step 6: Leverage Existing Prow Skills

For deeper OCP-level analysis beyond Ansible failures, can delegate to the prow-job:* skills from openshift-eng/ai-helpers (if installed):

/prow-job:analyze-test-failure -- structured test failure analysis from JUnit and console logs
/prow-job:extract-must-gather -- extract and browse must-gather archives from job artifacts
/prow-job:analyze-resource -- Kubernetes resource lifecycle analysis from audit/pod logs

These are generic OpenShift CI tools, not Ansible-aware. Use them when the failure is in OCP cluster operations (operator failures, resource issues) rather than in Jetlag's Ansible playbooks.

Output

Structured report:

## Prow Job Failure Analysis

**Job**: <job-name>
**PR**: #<PR#>
**Run**: <run-id>
**Result**: FAILED

### Root Cause
<Classification>: <one-line summary>

### Failed Task
- **Role**: <role-name>
- **Task**: <task-name>
- **Host**: <hostname>
- **Error**: <error message>

### Log Excerpt
(relevant 20-30 lines around the failure)

### Correlation with PR
<Whether the failure is in changed code, unrelated code, or infra>

### Recommendation
<Specific action: fix code, retest, file issue, etc.>
<If code fix: show the suggested change>

Four independent Claude Code skills that automate the full loop from reading a GitHub issue to implementing a fix, creating a PR, triggering Prow CI tests, and analyzing failures: - /jetlag-issue-fix: orchestrator driving the full 5-phase loop (understand, plan, implement, test, evaluate) with human gates - /jetlag-test-map: maps changed files to appropriate Prow test jobs with cross-repo detection for openshift/release changes - /jetlag-prow-trigger: triggers tests on PRs, polls for results with 5-min intervals and 3-hour timeout - /jetlag-prow-analyze: analyzes failed Prow job logs, correlates failures with PR changes, classifies root causes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Document that make ci-operator-config and make jobs must be run before pushing openshift/release companion PRs to regenerate config metadata and prow job files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-03-20T15:19:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rsevilla87 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The jetlag-issue-fix orchestrator now explicitly references /jetlag-test-map, /jetlag-prow-trigger, and /jetlag-prow-analyze via the Skill tool instead of inlining their logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mcornea and others added 3 commits March 20, 2026 14:53

Reference openshift-eng/ai-helpers as source of prow-job skills

8f38f0c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci bot requested review from jtaleric and rsevilla87 March 20, 2026 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agentic skills for fixing Jetlag GH issues and testing them in Prow jobs#788

Add agentic skills for fixing Jetlag GH issues and testing them in Prow jobs#788
mcornea wants to merge 4 commits intoredhat-performance:mainfrom
mcornea:agentic-issue-fix

mcornea commented Mar 20, 2026

Uh oh!

openshift-ci bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mcornea commented Mar 20, 2026

Architecture

/jetlag-issue-fix <issue#> -- Orchestrator

Phase 1: UNDERSTAND

Phase 2: PLAN

Phase 3: IMPLEMENT

Phase 4: TEST

Phase 5: EVALUATE

Safety Rails

/jetlag-test-map [PR#] -- Test Selection Engine

Input

Core Mapping Table

Feature Modifiers

Cross-Repo Detection

Output

/jetlag-prow-trigger <PR#> [job-name] -- Test Trigger + Monitor

Input Formats

Pre-flight Checks

Trigger

Monitor

On Completion

/jetlag-prow-analyze <prow-url> -- Failure Analysis

Step 1: Parse URL

Step 2: Fetch Artifacts

Step 3: Identify Failure Point

Step 4: Correlate with PR Changes

Step 5: Classify and Recommend

Step 6: Leverage Existing Prow Skills

Output

Uh oh!

openshift-ci bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`/jetlag-issue-fix <issue#>` -- Orchestrator

`/jetlag-test-map [PR#]` -- Test Selection Engine

`/jetlag-prow-trigger <PR#> [job-name]` -- Test Trigger + Monitor

`/jetlag-prow-analyze <prow-url>` -- Failure Analysis