fix: multi-GPU support for Newton physics with camera rendering#5236
fix: multi-GPU support for Newton physics with camera rendering#5236ClawLabby wants to merge 1 commit intoisaac-sim:developfrom
Conversation
391bfc2 to
5202601
Compare
Greptile SummaryThis PR fixes three bugs in multi-GPU distributed training with Newton physics and camera rendering: device assignment under
Confidence Score: 5/5Safe to merge after addressing the missing CHANGELOG entries and the minor test/preset quality issues. All findings are P2: shared mutable reference on unannotated preset aliases, a vacuous unit test, a redundant skipif decorator, and an incomplete copyright header. No production code has a definite runtime bug; the core device-assignment and camera-detection fixes are logically correct. source/isaaclab_tasks/isaaclab_tasks/utils/presets.py (unannotated alias fields) and source/isaaclab/test/test_multigpu_newton.py (test quality issues). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[torchrun spawns N processes] --> B[train.py: parse args]
B --> C[launch_simulation context manager]
C --> D{needs_kit?}
D -->|Newton + RTX cameras| E[AppLauncher.__init__]
E --> F[_resolve_device_settings]
F --> G{num_visible_gpus >= world_size?}
G -->|Yes| H[device_id = local_rank]
G -->|No - CUDA_VISIBLE_DEVICES restricted| I[device_id = 0]
H & I --> J[torch.cuda.set_device early]
D -->|Newton-only no Kit cameras| K[No AppLauncher]
K --> L[yield into with block]
J --> L
L --> M{args_cli.distributed?}
M -->|Yes| N[Re-compute device same logic]
N --> O[torch.cuda.set_device again]
O --> P[gym.make - Newton/Warp init on correct GPU]
M -->|No| P
Reviews (1): Last reviewed commit: "test: add multi-GPU Newton physics tests" | Re-trigger Greptile |
| isaacsim_rtx_renderer = default | ||
| physx = default |
There was a problem hiding this comment.
Unannotated aliases share a single mutable object
isaacsim_rtx_renderer = default and physx = default are not annotated, so @configclass does not copy them per instance — every MultiBackendRendererCfg() object shares the same class-level IsaacRtxRendererCfg() instance. If any call site or preset resolver ever mutates a field on one of these aliases, the mutation is visible through all three names (default, isaacsim_rtx_renderer, physx). Add type annotations so configclass gives each instance its own copy:
| isaacsim_rtx_renderer = default | |
| physx = default | |
| isaacsim_rtx_renderer: IsaacRtxRendererCfg = IsaacRtxRendererCfg() | |
| physx: IsaacRtxRendererCfg = IsaacRtxRendererCfg() |
| def test_device_assignment_logic(self): | ||
| """Test that device assignment logic works correctly for different scenarios.""" | ||
| # Scenario 1: 4 visible GPUs, world_size=4 → each rank gets its own GPU | ||
| num_visible = 4 | ||
| world_size = 4 | ||
| for local_rank in range(4): | ||
| if num_visible >= world_size: | ||
| expected_device = f"cuda:{local_rank}" | ||
| else: | ||
| expected_device = "cuda:0" | ||
| assert expected_device == f"cuda:{local_rank}" | ||
|
|
||
| # Scenario 2: 1 visible GPU (CUDA_VISIBLE_DEVICES restricted), world_size=4 | ||
| # → all ranks use cuda:0 (they each see only their assigned GPU) | ||
| num_visible = 1 | ||
| world_size = 4 | ||
| for local_rank in range(4): | ||
| if num_visible >= world_size: | ||
| expected_device = f"cuda:{local_rank}" | ||
| else: | ||
| expected_device = "cuda:0" | ||
| assert expected_device == "cuda:0" |
There was a problem hiding this comment.
Test only verifies its own inline logic, not the implementation
test_device_assignment_logic hard-codes num_visible and world_size locally and then asserts a condition computed with those same local values — it never calls AppLauncher._resolve_device_settings or the equivalent logic in train.py. A bug introduced in either of those functions (e.g. flipping >= to >) would leave this test green. The test should import and invoke the actual helper, or at minimum mock the environment variables and assert the resulting device_id.
| @pytest.mark.skipif(_get_num_gpus() < 2, reason="Requires at least 2 GPUs") | ||
| def test_cartpole_newton_multigpu(self): |
There was a problem hiding this comment.
Redundant
skipif on method already guarded by class decorator
TestMultiGPUDeviceAssignment is already decorated with @pytest.mark.skipif(_get_num_gpus() < 2, ...), so test_cartpole_newton_multigpu inherits that skip condition. The identical @pytest.mark.skipif on the method itself is noise and can be removed.
| @pytest.mark.skipif(_get_num_gpus() < 2, reason="Requires at least 2 GPUs") | |
| def test_cartpole_newton_multigpu(self): | |
| def test_cartpole_newton_multigpu(self): |
| # Copyright (c) 2022-2026, The Isaac Lab Project Developers. | ||
| # SPDX-License-Identifier: BSD-3-Clause | ||
| """Test multi-GPU device assignment for Newton physics.""" |
There was a problem hiding this comment.
New files must use the full project copyright header (AGENTS.md §"File headers and copyright"). The current header is missing the contributors URL and the "All rights reserved." line:
| # Copyright (c) 2022-2026, The Isaac Lab Project Developers. | |
| # SPDX-License-Identifier: BSD-3-Clause | |
| """Test multi-GPU device assignment for Newton physics.""" | |
| # Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md). | |
| # All rights reserved. | |
| # | |
| # SPDX-License-Identifier: BSD-3-Clause |
There was a problem hiding this comment.
Review: multi-GPU Newton physics with camera rendering
Good catch on a real production pain point — multi-GPU training with CUDA_VISIBLE_DEVICES restriction is a common deployment pattern (SLURM, Kubernetes) and this was genuinely broken. The three-bug diagnosis is solid and the fixes are coherent.
What works well
- The
num_visible >= world_sizeheuristic is the right approach for detecting per-process GPU restriction - Early
torch.cuda.set_device()is essential for Newton/Warp — physics backends that allocate on the "current" device during init would otherwise all pile onto cuda:0 - The
PresetCfgescape hatch in_is_kit_camera()is logically correct — preset renderers resolve to match the physics backend, so assuming non-Kit is the right default
Issues to address
See inline comments. Summary:
- Duplicated device logic between
train.pyandapp_launcher.py— should be factored out physx = defaultalias shares a mutable instance (Greptile flagged this too)torch.cuda.set_device()called twice in the distributed path — once inapp_launcher.py, once again intrain.py. The second call is redundant if AppLauncher already ran, or both are needed if Newton bypasses AppLauncher. The intent should be documented.- Test
test_device_assignment_logictests its own local variables, not the actual implementation - Missing CHANGELOG entries (as Greptile noted)
Overall: the production code changes are correct and well-motivated. The test and preset alias issues should be cleaned up before merge.
| env_cfg.sim.device = f"cuda:{local_rank}" | ||
| else: | ||
| env_cfg.sim.device = "cuda:0" | ||
| agent_cfg.device = env_cfg.sim.device |
There was a problem hiding this comment.
The device-selection logic here (num_visible >= world_size → cuda:local_rank, else cuda:0) is duplicated verbatim from app_launcher.py. This creates a maintenance risk — if someone updates one path and forgets the other, they'll silently diverge.
Consider extracting this into a shared utility, e.g.:
# isaaclab/utils/distributed.py
def resolve_cuda_device(local_rank: int) -> str:
num_visible = torch.cuda.device_count()
world_size = int(os.getenv("WORLD_SIZE", "1"))
if num_visible >= world_size:
return f"cuda:{local_rank}"
return "cuda:0"Both train.py and app_launcher.py can then call the same function. This also makes the logic unit-testable without mocking the entire AppLauncher.
| # (e.g. Newton/Warp) that use the current device get the correct GPU. | ||
| if "cuda" in env_cfg.sim.device: | ||
| torch.cuda.set_device(env_cfg.sim.device) | ||
|
|
There was a problem hiding this comment.
This torch.cuda.set_device() call may be redundant if AppLauncher._resolve_device_settings() already ran earlier (which also calls set_device). In the Newton-without-Kit path where AppLauncher is skipped, this is the only place it happens — which is critical.
Worth adding a comment clarifying the intent:
# Set device here because in the kitless Newton path, AppLauncher
# may not have run (launch_simulation skips it), so this is the
# only place the current CUDA device gets set.| # When CUDA_VISIBLE_DEVICES restricts each process to a single GPU, | ||
| # local_rank may exceed the visible device count. Fall back to cuda:0 | ||
| # so the process uses the one GPU it can see. | ||
| import torch |
There was a problem hiding this comment.
Importing torch inside this method is fine for the distributed branch (it's only hit when distributed=True), but the second import torch at line 941 is in the unconditional path — meaning every single-GPU AppLauncher init now imports torch at this point too. That's probably harmless in practice (torch is already imported by then), but the double import is a bit messy. Consider moving both to a single conditional import at the top of the method, or using the module-level import that likely already exists.
| default: IsaacRtxRendererCfg = IsaacRtxRendererCfg() | ||
| newton: NewtonWarpRendererCfg = NewtonWarpRendererCfg() | ||
| newton_renderer: NewtonWarpRendererCfg = NewtonWarpRendererCfg() | ||
| ovrtx_renderer: OVRTXRendererCfg = OVRTXRendererCfg() |
There was a problem hiding this comment.
+1 to Greptile's finding here. physx = default creates a class-level alias that points to the same IsaacRtxRendererCfg() instance as default. If @configclass doesn't deep-copy unannotated attributes, mutating one mutates all.
Also, is physx the right name for an alias to IsaacRtxRendererCfg? The PhysX backend doesn't inherently require RTX rendering — the name suggests "use whatever renderer is appropriate for PhysX" but it hard-wires RTX. If this is intentional (PhysX always uses RTX in Isaac Lab), a comment explaining the mapping would help.
| # PresetCfg renderers (e.g. MultiBackendRendererCfg) are resolved later; | ||
| # assume they will match the physics backend, so not necessarily Kit. | ||
| from isaaclab_tasks.utils import PresetCfg | ||
| if isinstance(renderer_cfg, PresetCfg): |
There was a problem hiding this comment.
This import is inside a function that could be called frequently during config scanning (_scan_config visits every node). The import is cached by Python after the first call, so there's no real performance issue, but it's worth considering whether this should be a module-level import instead.
Also: the fallback return True on line 132 (unchanged) means any unknown renderer type is treated as Kit. That's the conservative choice (forces Kit launch), but it means new renderer types added in the future will silently require Kit unless they're explicitly handled here. Worth a logger.debug() or comment noting this design choice.
| assert expected_device == f"cuda:{local_rank}" | ||
|
|
||
| # Scenario 2: 1 visible GPU (CUDA_VISIBLE_DEVICES restricted), world_size=4 | ||
| # → all ranks use cuda:0 (they each see only their assigned GPU) |
There was a problem hiding this comment.
This test doesn't actually test the implementation — it re-implements the same conditional logic inline and asserts against its own local variables. A bug in app_launcher.py (e.g., flipping >= to >) wouldn't be caught.
To make this a real unit test, either:
- Extract the device logic into a shared function (as suggested on train.py) and test that directly
- Mock
torch.cuda.device_count()andos.environand callAppLauncher._resolve_device_settings()(harder due to side effects) - At minimum, import and call the same code path used in production
As-is, this test is giving false confidence.
| test_dir = os.path.dirname(os.path.abspath(__file__)) # .../source/isaaclab/test | ||
| isaaclab_root = os.path.dirname(os.path.dirname(os.path.dirname(test_dir))) # .../IsaacLab | ||
|
|
||
| result = subprocess.run( |
There was a problem hiding this comment.
The subprocess.run integration test has a 300s timeout which is generous. A few concerns:
-
Hard-coded script path
scripts/reinforcement_learning/rsl_rl/train.py— this assumes the test is run from the IsaacLab root. Thecwdcomputation at lines 83-84 tries to handle this, but it's fragile (counts parent directories). -
NCCL_P2P_DISABLE=1andNCCL_IB_DISABLE=1— these are reasonable for CI but will mask P2P/IB-related failures. Worth a comment explaining why. -
Assertion on line 94 checks for
"Learning iteration 1"in stdout ORreturncode == 0. Theoris too permissive — a process that exits 0 without actually training (e.g., early skip) would pass. Consider requiring both conditions, or at least checking that stderr doesn't contain CUDA errors.
Fixes three interrelated bugs that prevented multi-GPU distributed training with Newton physics and camera observations: 1. Device assignment: When CUDA_VISIBLE_DEVICES restricts each process to a single GPU, local_rank may exceed the visible device count. Added resolve_cuda_device() utility that falls back to cuda:0 in this case. 2. Camera renderer preset: MultiBackendRendererCfg was missing a 'newton' field, so the preset system could not auto-select NewtonWarpRendererCfg. Also fixed unannotated alias fields (physx, isaacsim_rtx_renderer) that shared a single mutable instance. 3. Fabric mode detection: _is_kit_camera() did not recognize PresetCfg renderer configs, incorrectly flagging Newton cameras as Kit cameras. Additionally, torch.cuda.set_device() is now called early in AppLauncher and train.py so physics backends that allocate on the 'current' CUDA device get the correct GPU. Includes unit tests for resolve_cuda_device() and the preset/camera fixes.
5202601 to
f3005fd
Compare
| @configclass | ||
| class MultiBackendRendererCfg(PresetCfg): | ||
| default: IsaacRtxRendererCfg = IsaacRtxRendererCfg() | ||
| newton: NewtonWarpRendererCfg = NewtonWarpRendererCfg() |
There was a problem hiding this comment.
addition in this files are all unnecessary
| if isinstance(renderer_cfg, RendererCfg): | ||
| return renderer_cfg.renderer_type in ("default", "isaac_rtx") | ||
| # PresetCfg renderers (e.g. MultiBackendRendererCfg) are resolved later; | ||
| # assume they will match the physics backend, so not necessarily Kit. |
There was a problem hiding this comment.
this fix doesn't seems right it should be resolved correctly not patching it here
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR fixes three interrelated bugs preventing multi-GPU distributed training with Newton physics and camera rendering. The core approach — centralizing device resolution, adding early torch.cuda.set_device(), adding PresetCfg recognition, and adding the missing newton preset field — is architecturally sound. However, the new resolve_cuda_device() utility has a correctness bug that would break multi-node training, and the fix was only applied to one of four training scripts that share the same pattern.
Design Assessment
Approach is sound with one flaw. Creating a centralized resolve_cuda_device() utility is the right abstraction — the device resolution logic was duplicated between app_launcher.py and train.py, and this deduplicates it. The PresetCfg escape hatch in _is_kit_camera() correctly handles deferred renderer resolution. The isaacsim_rtx_renderer field annotation fix (from bare assignment to typed field) is a good catch that prevents shared-mutable-instance bugs with @configclass.
Findings
🔴 Critical: resolve_cuda_device() breaks multi-node training — source/isaaclab/isaaclab/utils/distributed.py:31
The function compares num_visible against WORLD_SIZE (global process count), but WORLD_SIZE spans all nodes. In a multi-node setup (e.g. 2 nodes × 4 GPUs), WORLD_SIZE=8 while each node sees 4 GPUs. The check num_visible(4) >= world_size(8) evaluates to False, causing every rank on a node to fall back to cuda:0 instead of using its local_rank.
The correct comparison is against local_rank directly:
num_visible = torch.cuda.device_count()
if local_rank < num_visible:
device_id = local_rank
else:
device_id = 0
return f"cuda:{device_id}", device_id
This correctly handles all three scenarios:
- Full visibility (4 GPUs, local_rank 0–3): each rank gets its own GPU ✓
- Restricted (1 GPU per process via
CUDA_VISIBLE_DEVICES): local_rank ≥ 1 falls back tocuda:0✓ - Multi-node (4 visible, 8 global): local_rank 0–3 < 4, each gets its own GPU ✓
The WORLD_SIZE variable and its os.getenv call can be removed entirely.
🟡 Warning: Same bug exists in rl_games/train.py and skrl/train.py — scripts/reinforcement_learning/
Only rsl_rl/train.py was updated to use resolve_cuda_device(). The other two RL training scripts still use the old f"cuda:{local_rank}" pattern:
scripts/reinforcement_learning/rl_games/train.py:115-118— usesf"cuda:{local_rank}"for bothagent_cfgandenv_cfg.sim.devicescripts/reinforcement_learning/skrl/train.py:132— usesf"cuda:{local_rank}"forenv_cfg.sim.device
Neither calls torch.cuda.set_device() either. These should be updated to use resolve_cuda_device() for consistency and to get the same CUDA-visible-devices fix. If intentionally left for a follow-up, a tracking issue or TODO comment would help.
🟡 Warning: resolve_cuda_device() does not validate local_rank against visible devices — source/isaaclab/isaaclab/utils/distributed.py
When num_visible >= world_size (or with the suggested fix, local_rank < num_visible), the function trusts local_rank without bounds-checking against num_visible. If local_rank ≥ num_visible in a misconfigured environment, this silently produces an invalid device string (e.g. cuda:4 when only 2 GPUs are visible), which would only fail later with an opaque CUDA error.
Consider adding a guard:
if device_id >= num_visible:
raise RuntimeError(
f"local_rank={local_rank} exceeds visible CUDA devices ({num_visible}). "
f"Check CUDA_VISIBLE_DEVICES and torchrun --nproc_per_node."
)🔵 Suggestion: Redundant torch.cuda.set_device() calls could use a comment — source/isaaclab/isaaclab/app/app_launcher.py:928 and scripts/reinforcement_learning/rsl_rl/train.py:143
The training script's comment explains the Newton kitless path clearly. However, when AppLauncher does run, set_device() is called twice (once in app_launcher.py, once in train.py). This is harmless but a brief note in the AppLauncher call like "Also called in training scripts for the kitless Newton path" would make the redundancy obviously intentional and prevent a future contributor from removing one call thinking it's dead code.
Test Coverage
- Bug fix PR: Has regression test? Yes —
test_multigpu_newton.pycovers all three fix areas with proper mocking. - Test quality: Good —
test_resolve_cuda_device_full_visibilityandtest_resolve_cuda_device_restricted_visibilitydirectly exercise the fixed function with mocked GPU counts.test_preset_renderer_matchingverifies thenewtonfield exists.test_camera_not_kit_with_preset_rendererverifies the_is_kit_camerafix. - Integration test (
test_cartpole_newton_multigpu) appropriately requires 2+ GPUs withpytest.mark.skipifand has a 300s timeout. TheNCCL_P2P_DISABLE/NCCL_IB_DISABLEenvironment vars are good practice for container compatibility. - Gap: No test covers the multi-node scenario (the critical bug above). Add a test case where
device_count=4andWORLD_SIZE=8to verifylocal_rank=2maps tocuda:2, notcuda:0.
CI Status
Only the labeler check is visible (pass). No lint, build, or test CI results are shown — this may be expected for the repo's CI configuration, but the absence of automated test runs on this PR is worth noting.
Verdict
COMMENT
The device resolution centralization and the three bug fixes are well-motivated and mostly correct. The critical multi-node bug in resolve_cuda_device() should be fixed before merge — it's a one-line change (compare against local_rank instead of WORLD_SIZE). The unfixed RL training scripts are a secondary concern that could be addressed in a follow-up.
| """ | ||
| num_visible = torch.cuda.device_count() | ||
| world_size = int(os.getenv("WORLD_SIZE", "1")) | ||
| if num_visible >= world_size: |
There was a problem hiding this comment.
🔴 Critical: Breaks multi-node training. This compares against WORLD_SIZE (global process count across all nodes), not local process count. In a 2-node × 4-GPU setup (WORLD_SIZE=8, num_visible=4), this evaluates to False and every rank falls back to cuda:0.
The simplest correct check is against local_rank directly:
| if num_visible >= world_size: | |
| if local_rank < num_visible: |
This makes the WORLD_SIZE env var unnecessary — line 30 can be removed entirely.
Description
Fixes three interrelated bugs that prevented multi-GPU distributed training with Newton physics and camera observations:
Device assignment: When
torchrunsetsCUDA_VISIBLE_DEVICESper process (each rank sees 1 GPU),cuda:{local_rank}points to a non-existent device. Fix: detect visible GPU count and fall back tocuda:0when each process is restricted to a single GPU.Camera renderer preset:
MultiBackendRendererCfgwas missing anewtonfield, so the preset system could not auto-selectNewtonWarpRendererCfgwhen usingpresets=newton. This caused cameras to default to Kit/RTX rendering, which conflicts with Newton physics in headless multi-GPU mode. Also addsphysxalias pointing to the default RTX renderer.Fabric mode detection:
_is_kit_camera()insim_launcher.pydid not recognizePresetCfgrenderer configs, incorrectly flagging Newton cameras as Kit cameras and forcing--enable_camerasto enable fabric — which causes race conditions on non-cuda:0 GPUs with Newton. Fix: treatPresetCfgrenderers as non-Kit (they resolve to match the physics backend).Additionally,
torch.cuda.set_device()is now called early in bothAppLauncherand the training script so that physics backends (Newton/Warp) that allocate on the "current" CUDA device during initialization get the correct GPU.Testing
source/isaaclab/test/test_multigpu_newton.py:Type of Change
Checklist
pre-commitchecks with./isaaclab.sh --format