Skip to content

[Bug] RNG states from multiple backends (e.g. CUDA + HPU) are saved but only one is restored on load_stateΒ #3960

@iavinas

Description

@iavinas

System Info

- `Accelerate` version: 1.13.0
- Platform: macOS-26.3-arm64-arm-64bit-Mach-O
- `accelerate` bash location: /Users/xxxxx/miniconda3/envs/accelerate/bin/accelerate
- Python version: 3.14.2
- Numpy version: 2.4.1
- PyTorch version: 2.10.0
- PyTorch accelerator: N/A
- System RAM: 16.00 GB
- `Accelerate` default config:
        Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

The bug is visible by comparing the source code of save_accelerator_state and load_accelerator_state in src/accelerate/checkpointing.py.

Current behavior
save_accelerator_state saves RNG for every detected backend.
load_accelerator_state restores only one due to the elif chain.
This breaks reproducibility on resume whenever more than one backend is available (e.g. both CUDA and HPU report as available), because dropout, data shuffling, augmentation randomness, etc. will diverge from the saved state.

Expected behavior

When accelerator.save_state() is called and multiple accelerator backends report as available (e.g. both is_cuda_available() and is_hpu_available() return True), RNG states for all such backends should be saved into the checkpoint (which they currently are β€” independent if statements in save_accelerator_state).

On accelerator.load_state(), all of those saved RNG states should be restored, so that resuming training continues with exactly the same random number generator states as when the checkpoint was created.

This ensures full reproducibility across checkpoint/resume cycles, even in environments where multiple hardware backends are detectable (mixed-driver containers, cloud images with several vendor packages pre-installed, research setups, etc.).

Minimal fix
Replace the elif chain in load_accelerator_state with independent if statements (matching the structure used in save_accelerator_state), e.g.:

-if is_mlu_available():
+if is_mlu_available():
    ...
-elif is_sdaa_available():
+elif is_sdaa_available():   # keep inner chain for mutually exclusive MLU/SDAA/MUSA
    ...
-elif is_musa_available():
+elif is_musa_available():
    ...
-elif is_hpu_available():
+if is_hpu_available():
    torch.hpu.set_rng_state_all(states["torch_hpu_manual_seed"])
-elif is_neuron_available():
+if is_neuron_available():
    torch.neuron.set_rng_state_all(states["torch_neuron_manual_seed"])
-else:
+if is_cuda_available():
    torch.cuda.set_rng_state_all(states["torch_cuda_manual_seed"])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions