Skip to content

ci: add GB200 functional test infrastructure#3365

Draft
ko3n1g wants to merge 10 commits intomainfrom
ko3n1g/feat/gb200-functional-tests
Draft

ci: add GB200 functional test infrastructure#3365
ko3n1g wants to merge 10 commits intomainfrom
ko3n1g/feat/gb200-functional-tests

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Apr 17, 2026

Summary

  • Restructures `launch_scripts/` from `<gpu_type>[-flaky]/` to `<gpu_type>/<active|flaky>/` so all hardware targets share the same top-level directory
  • Adds `h100/active/`, `h100/flaky/`, `gb200/active/`, `gb200/flaky/` — scripts moved accordingly
  • Extends `cicd-main.yml` with a parallel GB200 CI track using the same matrix build pattern as MLM:
    • `cicd-compute-build-matrix`: runs on `ubuntu-latest`, outputs a JSON matrix with AWS (AMD64) and GCP (ARM64) entries
    • `cicd-container-build` (matrix): each runner builds for its native architecture and pushes to its native registry — AMD64 → ECR on the AWS runner, ARM64 → GAR on `nemo-ci-gcp-gpu-x2`
    • `generate-gb200-test-matrix` / `cicd-functional-tests-gb200-l{0,1,2}` / `cicd-functional-tests-gb200-flaky`: run on `nemo-ci-gcp-gpu-x2`, pulling the ARM64 image from GAR
  • Updates `test-template/action.yml` default `script_dir` to `h100/active`

Per-hardware brokenness workflow:

  • Breaks on GB200 only → move wrapper from `gb200/active/` to `gb200/flaky/`; H100 keeps running from `h100/active/`
  • Breaks on both → move to `h100/flaky/` (existing convention); remove wrapper from `gb200/active/`

Directory layout

```
tests/functional_tests/launch_scripts/
├── h100/
│ ├── active/ ← H100 tests (run normally)
│ └── flaky/ ← H100 tests known to be flaky
└── gb200/
├── active/ ← GB200 wrapper scripts (delegate to h100/active/ via exec)
└── flaky/ ← GB200 tests known to be flaky
```

Build matrix

Entry Runner Registry Platform
aws `{runner_prefix}-gpu-x2` ECR linux/amd64
gcp `nemo-ci-gcp-gpu-x2` GAR linux/arm64

Example: marking a test GB200-only broken

```bash
git mv tests/functional_tests/launch_scripts/gb200/active/L0_Launch_recipes_llama_1b.sh
tests/functional_tests/launch_scripts/gb200/flaky/
```

H100 continues to run `h100/active/L0_Launch_recipes_llama_1b.sh` unaffected.

Test plan

  • Trigger `workflow_dispatch` (test_suite=`all`) — confirm H100 jobs use ECR image on AWS runner, GB200 jobs use GAR image on `nemo-ci-gcp-gpu-x2`
  • Move one wrapper to `gb200/flaky/` and confirm H100 still runs it while GB200 skips it in the active tier

🤖 Generated with Claude Code

Rename launch_scripts/{active → h100}/ and {flaky → h100-flaky}/ so all
directories are named after their target hardware. Add a parallel GB200
track that runs the same tests on {runner_prefix}-gb200-x2 runners.

- launch_scripts/gb200/: thin wrapper scripts that exec into h100/; one
  per h100/ script for full L0/L1/L2 parity at launch
- launch_scripts/gb200-flaky/: empty placeholder; move a GB200 wrapper
  here when it breaks on GB200 but not H100
- cicd-main.yml: generate-gb200-test-matrix job, three
  cicd-functional-tests-gb200-l{0,1,2} jobs and a gb200-flaky job using
  {runner_prefix}-gb200-x2; all gated on vars.GB200_RUNNER_PREFIX being
  set so environments without GB200 runners skip cleanly
- configure: propagates expect_gb200_l{0,1,2} outputs; Nemo_CICD_Test
  validates them the same way as H100 tiers
- test-template action: default script_dir updated from "active" to "h100"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

The variable gate caused GB200 jobs to be silently skipped when
GB200_RUNNER_PREFIX was not set as a repo variable. Since GB200 should
always run with the same trigger conditions as H100 (using the same
runner_prefix but -gb200-x2 suffix), remove the gate entirely.

Also simplify Nemo_CICD_Test: GB200 skip-checks reuse EXPECT_L{0,1,2}
rather than a separate EXPECT_GB200_L{0,1,2} set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

Updated PR description and workflow to use the literal runner label nemo-ci-gcp-gpu-x2 for all GB200 jobs (replaces the previous {runner_prefix}-gb200-x2 template). The GB200_RUNNER_PREFIX prerequisite no longer applies.

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

ko3n1g and others added 2 commits April 17, 2026 10:11
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test 4ee0e61

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test adf09fe

…ility

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test a6c0a9a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test 417129d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant