System Info
- `Accelerate` version: 1.13.0
- Platform: macOS-26.3-arm64-arm-64bit-Mach-O
- `accelerate` bash location: /Users/xxxxx/miniconda3/envs/accelerate/bin/accelerate
- Python version: 3.14.2
- Numpy version: 2.4.1
- PyTorch version: 2.10.0
- PyTorch accelerator: N/A
- System RAM: 16.00 GB
- `Accelerate` default config:
Not found
Information
Tasks
Reproduction
The bug is visible by comparing the source code of save_accelerator_state and load_accelerator_state in src/accelerate/checkpointing.py.
Current behavior
save_accelerator_state saves RNG for every detected backend.
load_accelerator_state restores only one due to the elif chain.
This breaks reproducibility on resume whenever more than one backend is available (e.g. both CUDA and HPU report as available), because dropout, data shuffling, augmentation randomness, etc. will diverge from the saved state.
Expected behavior
When accelerator.save_state() is called and multiple accelerator backends report as available (e.g. both is_cuda_available() and is_hpu_available() return True), RNG states for all such backends should be saved into the checkpoint (which they currently are β independent if statements in save_accelerator_state).
On accelerator.load_state(), all of those saved RNG states should be restored, so that resuming training continues with exactly the same random number generator states as when the checkpoint was created.
This ensures full reproducibility across checkpoint/resume cycles, even in environments where multiple hardware backends are detectable (mixed-driver containers, cloud images with several vendor packages pre-installed, research setups, etc.).
Minimal fix
Replace the elif chain in load_accelerator_state with independent if statements (matching the structure used in save_accelerator_state), e.g.:
-if is_mlu_available():
+if is_mlu_available():
...
-elif is_sdaa_available():
+elif is_sdaa_available(): # keep inner chain for mutually exclusive MLU/SDAA/MUSA
...
-elif is_musa_available():
+elif is_musa_available():
...
-elif is_hpu_available():
+if is_hpu_available():
torch.hpu.set_rng_state_all(states["torch_hpu_manual_seed"])
-elif is_neuron_available():
+if is_neuron_available():
torch.neuron.set_rng_state_all(states["torch_neuron_manual_seed"])
-else:
+if is_cuda_available():
torch.cuda.set_rng_state_all(states["torch_cuda_manual_seed"])
System Info
Information
Tasks
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py)Reproduction
The bug is visible by comparing the source code of
save_accelerator_stateandload_accelerator_stateinsrc/accelerate/checkpointing.py.save_accelerator_stateuses independentifstatements for each backend β it saves RNG states for all backends that report as available.β https://github.com/huggingface/accelerate/blob/main/src/accelerate/checkpointing.py#L161-L176
load_accelerator_stateuses a chain ofelif(withelsefor CUDA) β only one branch is executed, so only one RNG state is restored. Others are silently ignored.β https://github.com/huggingface/accelerate/blob/main/src/accelerate/checkpointing.py#L298-L313
Current behavior
save_accelerator_statesaves RNG for every detected backend.load_accelerator_staterestores only one due to theelifchain.This breaks reproducibility on resume whenever more than one backend is available (e.g. both CUDA and HPU report as available), because dropout, data shuffling, augmentation randomness, etc. will diverge from the saved state.
Expected behavior
When
accelerator.save_state()is called and multiple accelerator backends report as available (e.g. bothis_cuda_available()andis_hpu_available()returnTrue), RNG states for all such backends should be saved into the checkpoint (which they currently are β independentifstatements insave_accelerator_state).On
accelerator.load_state(), all of those saved RNG states should be restored, so that resuming training continues with exactly the same random number generator states as when the checkpoint was created.This ensures full reproducibility across checkpoint/resume cycles, even in environments where multiple hardware backends are detectable (mixed-driver containers, cloud images with several vendor packages pre-installed, research setups, etc.).
Minimal fix
Replace the
elifchain inload_accelerator_statewith independentifstatements (matching the structure used insave_accelerator_state), e.g.: