Skip to content

Add agentic skills for fixing Jetlag GH issues and testing them in Prow jobs#788

Open
mcornea wants to merge 4 commits intoredhat-performance:mainfrom
mcornea:agentic-issue-fix
Open

Add agentic skills for fixing Jetlag GH issues and testing them in Prow jobs#788
mcornea wants to merge 4 commits intoredhat-performance:mainfrom
mcornea:agentic-issue-fix

Conversation

@mcornea
Copy link
Copy Markdown
Collaborator

@mcornea mcornea commented Mar 20, 2026

Adding Claude skills that automate the full loop from reading GitHub issue, raising PRs with fixes and testing them in existing Prow jobs in the Jetlag PR or it can also create new Prow jobs for cases where the Jetlag change is not covered by existing jobs.

This is an example with the PRs created by Claude for addressing #765 : #787 and openshift/release#76604

Below is a more detailed explanation of what each skill is doing:

Architecture

/jetlag-issue-fix <issue#>          (orchestrator - drives the full loop)
    |
    +-- /jetlag-test-map             (maps changed files to Prow test jobs)
    +-- /jetlag-prow-trigger <PR#>   (triggers tests, monitors, fetches results)
    +-- /jetlag-prow-analyze <URL>   (analyzes failed Prow job logs)

Each skill works standalone:
  /jetlag-test-map                  -> give it a PR# or file list, get test recommendations
  /jetlag-prow-trigger 123          -> trigger + poll a test on PR #123
  /jetlag-prow-analyze <prow-url>   -> analyze any failed Prow job

/jetlag-issue-fix <issue#> -- Orchestrator

File: .claude/commands/jetlag-issue-fix.md

The main driver skill that takes a GitHub issue number and walks through 5 phases with mandatory human confirmation gates at each step.

Phase 1: UNDERSTAND

Fetches the issue from redhat-performance/jetlag via gh issue view, extracts error messages, affected cluster type (MNO/SNO/VMNO/hybrid), lab environment (scalelab/performancelab/ibmcloud), hardware type (r750/r660/r650/r640/etc.), and referenced files. Explores the code paths in the repo, checks git blame for recent changes, and presents an analysis for user review.

Human Gate 1: User must confirm before proceeding to planning.

Phase 2: PLAN

Identifies files to change and applies the test-map logic to determine which Prow jobs cover the change. This includes:

  • Mapping changed files to test jobs using the path-to-job table
  • Checking for feature modifiers (bond, public_vlan, hybrid, vmno)
  • Detecting cross-repo needs (new variables requiring openshift/release changes)

Presents a fix plan with code changes, test plan, and branch name.

Human Gate 2: User must confirm before implementation begins.

Phase 3: IMPLEMENT

  • Creates a branch from main (fix/issue-<N>-<short-desc>)
  • Makes the code changes as planned
  • Commits with Fixes #<N> reference and co-author attribution
  • Pushes to origin
  • Creates a PR against redhat-performance/jetlag with a test plan checklist in the body

Phase 4: TEST

Two flows depending on whether existing tests cover the change:

Standard Flow:

  • Triggers tests via /test <job> PR comment
  • Polls gh api commit statuses every 5 minutes
  • 3-hour timeout per job
  • Reports progress inline

Cross-Repo Flow (when a new test job is needed in openshift/release):

  • Creates the Jetlag PR first
  • Clones openshift/release from upstream (not from a potentially stale fork)
  • Makes changes to 3 files:
    • ci-operator/config/redhat-performance/jetlag/redhat-performance-jetlag-main.yaml (test entry)
    • ci-operator/step-registry/openshift-qe/installer/bm/deploy/openshift-qe-installer-bm-deploy-ref.yaml (env var declaration)
    • ci-operator/step-registry/openshift-qe/installer/bm/deploy/openshift-qe-installer-bm-deploy-commands.sh (script logic)
  • Runs make ci-operator-config and make jobs to regenerate required config metadata and prow job files (these checks fail without this step)
  • Creates a companion PR with JETLAG_PR set to test the jetlag branch
  • Triggers tests from the openshift/release PR

Human Gate 3: User must confirm before any test is triggered (bare-metal CI is expensive, ~1-2 hours per job).

Phase 5: EVALUATE

  • Pass: Checks off tests in the PR body, asks about triggering additional recommended tests. When all pass, declares success but never merges -- leaves /lgtm + /approve for humans.
  • Infra Error: Classifies failures in ping/poweroff/BMC steps as hardware flakes. Suggests /retest. Does NOT count toward the 3-iteration limit.
  • Code Failure: Analyzes logs, proposes a fix, asks for user confirmation before pushing a new commit (always a new commit, never force-push or amend). Resumes polling after re-trigger.
  • Escalation: After 3 failed iterations, presents the full iteration history and stops.

Human Gate 4: User must confirm before each fix iteration.

Safety Rails

  • Max 3 fix iterations before escalating to human
  • Max 3-hour timeout per test job
  • Never trigger >1 test simultaneously (bare-metal hardware constraint)
  • Always add new commits (never force-push, never amend published commits)
  • Always reference issue number in commits and PR body
  • Never merges -- presents results for human /lgtm + /approve
  • Never skips human gates

/jetlag-test-map [PR#] -- Test Selection Engine

File: .claude/commands/jetlag-test-map.md

Maps changed files to the appropriate Prow CI test jobs. Can be used standalone or is called inline by the orchestrator.

Input

  • A PR number: fetches diff via gh pr diff --name-only
  • No argument: uses git diff --name-only main from the current branch
  • Comma-separated file list: uses those files directly

Core Mapping Table

Changed Path Pattern Minimum Test Extra Coverage
ansible/roles/bastion-*/** deploy-sno deploy-mno
ansible/roles/create-inventory/** deploy-sno deploy-mno
ansible/roles/validate-vars/** deploy-sno deploy-mno
ansible/roles/install-cluster/** deploy-sno deploy-mno
ansible/roles/boot-iso/** deploy-sno deploy-mno
ansible/roles/wait-hosts-discovered/** deploy-sno deploy-mno
ansible/roles/create-ai-cluster/** deploy-sno deploy-mno
ansible/roles/generate-discovery-iso/** deploy-sno deploy-mno
ansible/roles/sno-post-cluster-install/** deploy-sno --
ansible/roles/mno-post-cluster-install/** deploy-mno deploy-cmno
ansible/roles/hv-*/** deploy-vmno deploy-hmno
ansible/roles/ocp-scale-out*/** deploy-mno-scaleout deploy-sno-scaleout
ansible/roles/badfish/** deploy-sno deploy-mno
ansible/mno-deploy.yml deploy-mno deploy-cmno
ansible/sno-deploy.yml deploy-sno --
ansible/hv-setup.yml / hv-vm-create.yml deploy-vmno deploy-hmno
ansible/ocp-scale-out.yml deploy-mno-scaleout --
ansible/vars/lab.yml deploy-sno deploy-mno
ansible/vars/all.sample.yml deploy-sno deploy-mno
docs/**, *.md, CLAUDE.md No test needed --
bootstrap.sh No test needed --

Feature Modifiers

Scans the actual diff content (not just filenames) for patterns and adds variant jobs:

  • enable_bond or bond logic -> adds *-private-bond variants
  • public_vlan logic -> adds *-private variants
  • hybrid_worker_count -> adds deploy-hmno
  • cluster_type == "vmno" or vmno logic -> adds deploy-vmno
  • FIPS-related logic -> adds *-fips variants if they exist

Cross-Repo Detection

Checks if new variables are added to all.sample.yml that aren't in the known set of CI env vars:

  • Known: TYPE, NUM_WORKER_NODES, NUM_HYBRID_WORKER_NODES, PUBLIC_VLAN, BOND, FIPS, JETLAG_PR, JETLAG_LATEST, JETLAG_BRANCH, OCP_BUILD, OCP_VERSION, DISCONNECTED

If a new gating variable is found:

  • Flags the need for an openshift/release companion PR
  • Provides draft YAML snippets for the test entry, env var declaration, and script modification
  • Notes that make ci-operator-config and make jobs must be run before pushing (requires podman or docker)

Output

Structured recommendation with sections for minimum tests, extra coverage, feature-specific tests, and cross-repo notes.


/jetlag-prow-trigger <PR#> [job-name] -- Test Trigger + Monitor

File: .claude/commands/jetlag-prow-trigger.md

Triggers Prow CI tests on PRs and polls until completion with inline progress reporting.

Input Formats

  • 123 deploy-sno -- trigger a specific job on jetlag PR Update kube-burner install to v0.14.2 #123
  • 123 -- auto-detect jobs via test-map logic, ask user to confirm
  • --release-pr 456 deploy-foo -- trigger on an openshift/release PR

When targeting openshift/release PRs, the PR must have passed ci-operator-config-metadata and generated-config checks first (requires make ci-operator-config and make jobs).

Pre-flight Checks

  1. Gets the HEAD SHA via gh pr view --json headRefOid
  2. Checks for ok-to-test label (required for non-org members)
  3. Checks if a test is already running or completed on that SHA

Trigger

Always asks for user confirmation first -- bare-metal CI is expensive and takes 1-2 hours per job. Comments /test <job> on the PR. Only triggers one job at a time due to hardware constraints (single bare-metal allocation shared across all jetlag CI).

Monitor

  1. Waits 2 minutes for initial status to appear
  2. Polls every 5 minutes via gh api repos/.../commits/<SHA>/statuses
  3. Reports each poll result inline:
    • pending -> "Test still running... (elapsed: Xm)"
    • success -> "Test PASSED"
    • failure -> "Test FAILED -- analyzing..."
    • error -> "Test ERROR (infra issue)"
  4. Times out after 3 hours

On Completion

Success: Reports pass, asks about triggering additional recommended tests from the test-map.

Failure:

  1. Extracts Prow URL from the status target_url
  2. Derives the GCS artifacts path:
    https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/
    pr-logs/pull/redhat-performance_jetlag/<PR>/
    pull-ci-redhat-performance-jetlag-main-<job>/<run-id>/
    
  3. Fetches finished.json and build-log.txt of the failed step
  4. Presents a summary with failed step, log excerpt, and pointer to /jetlag-prow-analyze

Infra Error: Classifies failures in ping/poweroff/BMC access steps as hardware flakes, suggests /retest <job>.


/jetlag-prow-analyze <prow-url> -- Failure Analysis

File: .claude/commands/jetlag-prow-analyze.md

Analyzes failed Prow job logs to identify root causes and suggest fixes. Can be used standalone on any Prow job URL.

Step 1: Parse URL

Extracts from the Prow URL:

  • PR number (from path)
  • Job name (e.g., deploy-sno, deploy-mno)
  • Run ID (build number)
  • GCS base path

Handles both PR jobs (pr-logs/pull/) and periodic/batch jobs (logs/).

Step 2: Fetch Artifacts

Downloads via WebFetch from the GCS artifacts bucket:

  1. finished.json -- overall result and timestamp
  2. build-log.txt -- full build log
  3. Step-level logs at artifacts/<step-name>/build-log.txt (e.g., openshift-qe-installer-bm-deploy)

Step 3: Identify Failure Point

Scans for Ansible failure patterns in priority order:

  1. fatal: [hostname]: FAILED! -- extracts task name (from TASK [role : task name] line), error msg: field, and failed host
  2. PLAY RECAP with failed>0 -- identifies which host(s) failed and at which play
  3. MODULE FAILURE -- module-level crash with traceback
  4. ERROR! -- Ansible-level errors (syntax, undefined variable, unreachable host)
  5. Non-Ansible failures -- error:/Error: lines, Python tracebacks, shell command failures (exit code != 0)

Step 4: Correlate with PR Changes

For PR jobs:

  1. Gets the PR's changed files via gh pr diff --name-only
  2. Compares the failed task/role against the changed files
  3. Classifies:
    • Changed code failure: failed task/role is in PR's modified files -> likely our bug
    • Unrelated failure: failed task/role is NOT in modified files -> pre-existing issue or infra
    • Infra failure: failure in hardware/network operations -> flake

Step 5: Classify and Recommend

Five classification categories:

Category Indicators Recommendation
Code Bug Failure in changed code Show relevant diff, suggest specific fix, push fix commit
Pre-existing Bug Failure in unchanged code File separate issue, /retest to confirm unrelated
Infra Flake ping/poweroff/ipmitool/racadm/badfish failures; unreachable/timed out/connection refused errors /retest <job> -- don't count as code failure
Timeout Cluster install or host discovery exceeded limits Check if PR introduces slower operations, or /retest
Configuration Error Undefined variable, missing file, bad YAML syntax Fix the syntax/variable issue

Step 6: Leverage Existing Prow Skills

For deeper OCP-level analysis beyond Ansible failures, can delegate to the prow-job:* skills from openshift-eng/ai-helpers (if installed):

  • /prow-job:analyze-test-failure -- structured test failure analysis from JUnit and console logs
  • /prow-job:extract-must-gather -- extract and browse must-gather archives from job artifacts
  • /prow-job:analyze-resource -- Kubernetes resource lifecycle analysis from audit/pod logs

These are generic OpenShift CI tools, not Ansible-aware. Use them when the failure is in OCP cluster operations (operator failures, resource issues) rather than in Jetlag's Ansible playbooks.

Output

Structured report:

## Prow Job Failure Analysis

**Job**: <job-name>
**PR**: #<PR#>
**Run**: <run-id>
**Result**: FAILED

### Root Cause
<Classification>: <one-line summary>

### Failed Task
- **Role**: <role-name>
- **Task**: <task-name>
- **Host**: <hostname>
- **Error**: <error message>

### Log Excerpt
(relevant 20-30 lines around the failure)

### Correlation with PR
<Whether the failure is in changed code, unrelated code, or infra>

### Recommendation
<Specific action: fix code, retest, file issue, etc.>
<If code fix: show the suggested change>

mcornea and others added 3 commits March 20, 2026 14:53
Four independent Claude Code skills that automate the full loop from
reading a GitHub issue to implementing a fix, creating a PR, triggering
Prow CI tests, and analyzing failures:

- /jetlag-issue-fix: orchestrator driving the full 5-phase loop
  (understand, plan, implement, test, evaluate) with human gates
- /jetlag-test-map: maps changed files to appropriate Prow test jobs
  with cross-repo detection for openshift/release changes
- /jetlag-prow-trigger: triggers tests on PRs, polls for results
  with 5-min intervals and 3-hour timeout
- /jetlag-prow-analyze: analyzes failed Prow job logs, correlates
  failures with PR changes, classifies root causes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document that make ci-operator-config and make jobs must be run before
pushing openshift/release companion PRs to regenerate config metadata
and prow job files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci openshift-ci bot requested review from jtaleric and rsevilla87 March 20, 2026 15:19
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rsevilla87 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The jetlag-issue-fix orchestrator now explicitly references
/jetlag-test-map, /jetlag-prow-trigger, and /jetlag-prow-analyze
via the Skill tool instead of inlining their logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant