[Bug] RNG states from multiple backends (e.g. CUDA + HPU) are saved but only one is restored on load_state

### System Info

```Shell
- `Accelerate` version: 1.13.0
- Platform: macOS-26.3-arm64-arm-64bit-Mach-O
- `accelerate` bash location: /Users/xxxxx/miniconda3/envs/accelerate/bin/accelerate
- Python version: 3.14.2
- Numpy version: 2.4.1
- PyTorch version: 2.10.0
- PyTorch accelerator: N/A
- System RAM: 16.00 GB
- `Accelerate` default config:
        Not found
```

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [ ] My own task or dataset (give details below)

### Reproduction

The bug is visible by comparing the source code of `save_accelerator_state` and `load_accelerator_state` in `src/accelerate/checkpointing.py`.

- `save_accelerator_state` uses independent `if` statements for each backend → it saves RNG states for **all** backends that report as available.  
  → https://github.com/huggingface/accelerate/blob/main/src/accelerate/checkpointing.py#L161-L176

- `load_accelerator_state` uses a chain of `elif` (with `else` for CUDA) → only **one** branch is executed, so only **one** RNG state is restored. Others are silently ignored.  
  → https://github.com/huggingface/accelerate/blob/main/src/accelerate/checkpointing.py#L298-L313

**Current behavior**  
`save_accelerator_state` saves RNG for every detected backend.  
`load_accelerator_state` restores only one due to the `elif` chain.  
This breaks reproducibility on resume whenever more than one backend is available (e.g. both CUDA and HPU report as available), because dropout, data shuffling, augmentation randomness, etc. will diverge from the saved state.

### Expected behavior

When `accelerator.save_state()` is called and multiple accelerator backends report as available (e.g. both `is_cuda_available()` and `is_hpu_available()` return `True`), RNG states for **all** such backends should be saved into the checkpoint (which they currently are — independent `if` statements in `save_accelerator_state`).

On `accelerator.load_state()`, **all** of those saved RNG states should be restored, so that resuming training continues with exactly the same random number generator states as when the checkpoint was created.

This ensures full reproducibility across checkpoint/resume cycles, even in environments where multiple hardware backends are detectable (mixed-driver containers, cloud images with several vendor packages pre-installed, research setups, etc.).

**Minimal fix**  
Replace the `elif` chain in `load_accelerator_state` with independent `if` statements (matching the structure used in `save_accelerator_state`), e.g.:

```diff
-if is_mlu_available():
+if is_mlu_available():
    ...
-elif is_sdaa_available():
+elif is_sdaa_available():   # keep inner chain for mutually exclusive MLU/SDAA/MUSA
    ...
-elif is_musa_available():
+elif is_musa_available():
    ...
-elif is_hpu_available():
+if is_hpu_available():
    torch.hpu.set_rng_state_all(states["torch_hpu_manual_seed"])
-elif is_neuron_available():
+if is_neuron_available():
    torch.neuron.set_rng_state_all(states["torch_neuron_manual_seed"])
-else:
+if is_cuda_available():
    torch.cuda.set_rng_state_all(states["torch_cuda_manual_seed"])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] RNG states from multiple backends (e.g. CUDA + HPU) are saved but only one is restored on load_state #3960

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] RNG states from multiple backends (e.g. CUDA + HPU) are saved but only one is restored on load_state #3960

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions