ci: add GB200 functional test infrastructure by ko3n1g · Pull Request #3365 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-04-17T00:12:10Z

Summary

Restructures `launch_scripts/` from `<gpu_type>[-flaky]/` to `<gpu_type>/<active|flaky>/` so all hardware targets share the same top-level directory
Adds `h100/active/`, `h100/flaky/`, `gb200/active/`, `gb200/flaky/` — scripts moved accordingly
Extends `cicd-main.yml` with a parallel GB200 CI track using the same matrix build pattern as MLM:
- `cicd-compute-build-matrix`: runs on `ubuntu-latest`, outputs a JSON matrix with AWS (AMD64) and GCP (ARM64) entries
- `cicd-container-build` (matrix): each runner builds for its native architecture and pushes to its native registry — AMD64 → ECR on the AWS runner, ARM64 → GAR on `nemo-ci-gcp-gpu-x2`
- `generate-gb200-test-matrix` / `cicd-functional-tests-gb200-l{0,1,2}` / `cicd-functional-tests-gb200-flaky`: run on `nemo-ci-gcp-gpu-x2`, pulling the ARM64 image from GAR
Updates `test-template/action.yml` default `script_dir` to `h100/active`

Per-hardware brokenness workflow:

Breaks on GB200 only → move wrapper from `gb200/active/` to `gb200/flaky/`; H100 keeps running from `h100/active/`
Breaks on both → move to `h100/flaky/` (existing convention); remove wrapper from `gb200/active/`

Directory layout

```
tests/functional_tests/launch_scripts/
├── h100/
│ ├── active/ ← H100 tests (run normally)
│ └── flaky/ ← H100 tests known to be flaky
└── gb200/
├── active/ ← GB200 wrapper scripts (delegate to h100/active/ via exec)
└── flaky/ ← GB200 tests known to be flaky
```

Build matrix

Entry	Runner	Registry	Platform
aws	`{runner_prefix}-gpu-x2`	ECR	linux/amd64
gcp	`nemo-ci-gcp-gpu-x2`	GAR	linux/arm64

Example: marking a test GB200-only broken

```bash
git mv tests/functional_tests/launch_scripts/gb200/active/L0_Launch_recipes_llama_1b.sh
tests/functional_tests/launch_scripts/gb200/flaky/
```

H100 continues to run `h100/active/L0_Launch_recipes_llama_1b.sh` unaffected.

Test plan

Trigger `workflow_dispatch` (test_suite=`all`) — confirm H100 jobs use ECR image on AWS runner, GB200 jobs use GAR image on `nemo-ci-gcp-gpu-x2`
Move one wrapper to `gb200/flaky/` and confirm H100 still runs it while GB200 skips it in the active tier

🤖 Generated with Claude Code

Rename launch_scripts/{active → h100}/ and {flaky → h100-flaky}/ so all directories are named after their target hardware. Add a parallel GB200 track that runs the same tests on {runner_prefix}-gb200-x2 runners. - launch_scripts/gb200/: thin wrapper scripts that exec into h100/; one per h100/ script for full L0/L1/L2 parity at launch - launch_scripts/gb200-flaky/: empty placeholder; move a GB200 wrapper here when it breaks on GB200 but not H100 - cicd-main.yml: generate-gb200-test-matrix job, three cicd-functional-tests-gb200-l{0,1,2} jobs and a gb200-flaky job using {runner_prefix}-gb200-x2; all gated on vars.GB200_RUNNER_PREFIX being set so environments without GB200 runners skip cleanly - configure: propagates expect_gb200_l{0,1,2} outputs; Nemo_CICD_Test validates them the same way as H100 tiers - test-template action: default script_dir updated from "active" to "h100" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot · 2026-04-17T00:12:14Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ko3n1g · 2026-04-17T00:12:47Z

/ok to test

The variable gate caused GB200 jobs to be silently skipped when GB200_RUNNER_PREFIX was not set as a repo variable. Since GB200 should always run with the same trigger conditions as H100 (using the same runner_prefix but -gb200-x2 suffix), remove the gate entirely. Also simplify Nemo_CICD_Test: GB200 skip-checks reuse EXPECT_L{0,1,2} rather than a separate EXPECT_GB200_L{0,1,2} set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-04-17T00:29:45Z

/ok to test

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-04-17T09:22:17Z

Updated PR description and workflow to use the literal runner label nemo-ci-gcp-gpu-x2 for all GB200 jobs (replaces the previous {runner_prefix}-gb200-x2 template). The GB200_RUNNER_PREFIX prerequisite no longer applies.

ko3n1g · 2026-04-17T09:23:31Z

/ok to test

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-04-17T09:32:30Z

/ok to test

Signed-off-by: oliver könig <okoenig@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-04-17T10:19:50Z

/ok to test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-04-17T11:11:51Z

/ok to test 4ee0e61

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-04-17T11:31:56Z

/ok to test adf09fe

…ility Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-04-17T12:16:11Z

/ok to test a6c0a9a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-04-17T12:17:33Z

/ok to test 417129d

ko3n1g added the full-test-suite label Apr 17, 2026

copy-pr-bot bot temporarily deployed to test April 17, 2026 00:13 Inactive

copy-pr-bot bot temporarily deployed to test April 17, 2026 00:30 Inactive

ci: use nemo-ci-gcp-gpu-x2 label for GB200 runners

34fee18

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 17, 2026 09:24 Inactive

refactor: restructure launch_scripts to <gpu_type>/<active|flaky> layout

00fc3e2

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 17, 2026 09:34 Inactive

ko3n1g and others added 2 commits April 17, 2026 10:11

ci: add ARM64 container build on GCP runner and wire GB200 tests to GAR

18bcaff

Signed-off-by: oliver könig <okoenig@nvidia.com>

ci: replace standalone arm64 build job with matrix build pattern

0337d9b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 17, 2026 10:20 Inactive

fix(ci): update GB200 wrapper paths after launch_scripts/ restructure

4ee0e61

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 17, 2026 11:13 Inactive

fix(ci): restore execute permissions on h100 launch scripts

adf09fe

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 17, 2026 11:33 Inactive

build: set MAMBA_FORCE_BUILD=TRUE in Dockerfile.ci for ARM64 compatib…

a6c0a9a

…ility Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

build: also set CAUSAL_CONV1D_FORCE_BUILD=TRUE for ARM64 compatibility

417129d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot deployed to test April 17, 2026 12:18 Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add GB200 functional test infrastructure#3365

ci: add GB200 functional test infrastructure#3365
ko3n1g wants to merge 10 commits intomainfrom
ko3n1g/feat/gb200-functional-tests

ko3n1g commented Apr 17, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ko3n1g commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Directory layout

Build matrix

Example: marking a test GB200-only broken

Test plan

Uh oh!

copy-pr-bot bot commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

ko3n1g commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ko3n1g commented Apr 17, 2026 •

edited

Loading