Skip to content

Conversation

@boeschf
Copy link
Contributor

@boeschf boeschf commented Dec 9, 2025

Pytorch 2.9.1 and friends

This uenv recipe was designed with the intention of being able to run
Megatron-LM based pre-training workloads out of the box. Thus, it comes with
batteries included and does not just provide the bare PyTorch framework.

Main difference to v2.8.0

  • Updated torch libraries (2.9.1)
  • Cutting edge communication stack (latest libfabric/nccl/aws-ofi-nccl, with patches for CXI RDMA)

Contents (selection)

  • aws-ofi-nccl 1.18.0
  • cassini-headers git.release/shs-13.0.0=13.0.0
  • cray-mpich 9.0.0
  • cuda 12.9.1
  • cxi-driver git.release/shs-13.0.0=13.0.0
  • gcc 14.2.0
  • gdrcopy 2.5.1
  • libfabric 2.4.0-dev
  • nccl 2.29.2-1
  • nccl-tests 2.17.6
  • nvshmem 3.4.5
  • osu-micro-benchmarks 7.5.2
  • python 3.12.12

A full list of the python packages:

$ pip list
Package                 Version
----------------------- -----------------
absl-py                 1.4.0
annotated-types         0.7.0
apex                    0.1
certifi                 2025.7.14
charset-normalizer      3.4.4
cuda-bindings           12.9.1
cuda-core               0.2.0
cuda-pathfinder         1.2.3
cuda-python             12.9.1
Cython                  3.2.3
einops                  0.8.1
faiss                   1.8.0
filelock                3.19.1
fsspec                  2025.9.0
grpcio                  1.75.0
hf-xet                  1.2.0
huggingface_hub         0.36.0
idna                    3.10
importlib_metadata      8.7.0
iniconfig               2.1.0
Jinja2                  3.1.6
lightning-utilities     0.11.2
Markdown                3.4.1
MarkupSafe              3.0.2
meson                   1.8.5
ml_dtypes               0.5.3
mpi4py                  4.1.1
mpmath                  1.3.0
nanobind                2.8.0
networkx                3.5
numpy                   2.4.1
nvshmem4py-cu12         0.1.2
nvtx                    0.2.12
onnx                    1.20.0
onnx-ir                 0.1.12
onnxscript              0.5.6.dev20260122
packaging               25.0
pillow                  12.1.0
pip                     25.1.1
pluggy                  1.6.0
protobuf                6.33.1
pybind11                3.0.1
pyclibrary              0.2.2
pydantic                2.12.4
pydantic_core           2.41.5
Pygments                2.19.2
pyparsing               3.2.5
pytest                  9.0.0
PyYAML                  6.0.3
regex                   2025.11.3
requests                2.32.5
safetensors             0.6.2
setuptools              79.0.1
six                     1.17.0
sympy                   1.14.0
tensorboard             2.20.0
tensorboard_data_server 0.7.0
tokenizers              0.22.1
torch                   2.9.1
torchaudio              2.9.1+a224ab2
torchmetrics            1.8.2
torchvision             0.24.1
tqdm                    4.67.1
transformer_engine      2.11.0+c188b53
transformers            4.57.0
triton                  3.5.1
typing_extensions       4.15.0
typing-inspection       0.4.2
urllib3                 2.5.0
Werkzeug                3.1.3
zipp                    3.23.0

Intra-node

MPI

$ srun -N1 --ntasks-per-node=2 osu_bw --type mpi_float --message-size 32:2^22:2 --tail-lat H H

# OSU MPI Bandwidth Test v7.5.2
# Datatype: MPI_FLOAT.
# Size      Bandwidth (MB/s) P50 Tail BW(MB/s) P90 Tail BW(MB/s) P99 Tail BW(MB/s)
32                     60.41             60.44             60.78             61.07
64                    119.10            119.63            120.53            120.76
128                   238.48            238.37            239.71            241.07
256                   471.04            471.47            474.10            475.86
512                   828.12            829.17            834.58            838.68
1024                  830.27            827.17            847.70            855.86
2048                 1666.09           1660.38           1694.03           1708.16
4096                 3338.81           3346.53           3375.49           3427.75
8192                 6735.56           6737.10           6795.70           6875.64
16384               13411.31          13386.13          13563.44          13614.16
32768               20999.50          21032.94          21216.57          21230.53
65536               30841.59          30827.10          30943.54          30950.62
131072              36438.04          36506.65          36613.57          36649.40
262144              41196.43          41232.10          41470.07          41470.07
524288              29939.40          29905.72          30573.60          30587.90
1048576             36697.16          36716.81          36994.03          37021.45
2097152             32766.71          33145.18          33657.70          33664.46
4194304             27278.15          27202.37          27631.37          27922.65

Inter-node

MPI

$ srun -N2 --ntasks-per-node=1 osu_bw --type mpi_float --message-size 32:2^22:2 --tail-lat H H

# OSU MPI Bandwidth Test v7.5.2
# Datatype: MPI_FLOAT.
# Size      Bandwidth (MB/s) P50 Tail BW(MB/s) P90 Tail BW(MB/s) P99 Tail BW(MB/s)
32                     35.73             35.92             36.26             36.59
64                     71.42             71.39             72.56             73.44
128                   142.84            143.18            144.23            145.54
256                   293.67            294.94            299.25            303.14
512                   584.69            585.49            596.41            602.02
1024                 1099.91           1190.05           1206.17           1215.48
2048                 2325.20           2327.36           2394.01           2429.51
4096                 4530.90           4543.70           4631.03           4768.51
8192                 8241.97           8325.47           8875.71           8987.69
16384               13328.09          13342.53          13585.93          13670.96
32768               17778.13          17589.43          18811.24          18936.25
65536               21944.99          22022.20          22062.98          22085.28
131072              23013.11          23048.41          23109.37          23109.43
262144              23487.86          23487.24          23498.82          23500.92
524288              23714.36          23701.17          23762.40          23766.71
1048576             23866.77          23855.37          23891.52          23893.70
2097152             23941.36          23948.13          23951.82          23952.37
4194304             23979.73          23978.94          23982.36          23982.50

Unidirectional Bandwidth D D

Intra-node

MPI

$ srun -N1 --ntasks-per-node=2 osu_bw --type mpi_float --message-size 32:2^22:2 --tail-lat --accelerator cuda D D

# OSU MPI-CUDA Bandwidth Test v7.5.2
# Datatype: MPI_FLOAT.
# Size      Bandwidth (MB/s) P50 Tail BW(MB/s) P90 Tail BW(MB/s) P99 Tail BW(MB/s)
32                     56.45             56.49             56.94             57.20
64                    112.38            112.38            113.48            114.09
128                   221.24            221.84            224.57            226.76
256                   439.95            440.63            445.23            449.92
512                    50.50             50.56             50.67             50.77
1024                  346.49            347.54            350.46            351.48
2048                  698.45            699.48            701.15            703.44
4096                 1397.95           1399.91           1403.99           1405.67
8192                 2787.41           2790.28           2796.47           2800.77
16384                5557.42           5567.29           5583.41           5583.41
32768               11066.52          11068.75          11096.86          11096.86
65536               21786.50          21788.03          21835.21          21849.77
131072              40471.26          40468.18          40618.67          40631.06
262144              58951.77          58923.87          59169.92          59283.87
524288              82693.42          82684.88          82808.94          82815.69
1048576             98171.41          98198.23          98295.02          98322.53
2097152            112012.63         112099.86         112150.82         112156.82
4194304            121503.27         121500.57         121530.49         121535.77

Inter-node

MPI

$ srun -N2 --ntasks-per-node=1 osu_bw --type mpi_float --message-size 32:2^22:2 --tail-lat --accelerator cuda D D
# OSU MPI-CUDA Bandwidth Test v7.5.2
# Datatype: MPI_FLOAT.
# Size      Bandwidth (MB/s) P50 Tail BW(MB/s) P90 Tail BW(MB/s) P99 Tail BW(MB/s)
32                     54.82             55.22             56.04             56.29
64                    110.16            110.54            111.60            112.28
128                   215.34            218.62            221.08            222.61
256                   447.97            456.34            462.94            465.47
512                   900.68            903.02            918.41            929.25
1024                 1598.23           1820.49           1850.10           1878.95
2048                 3533.37           3552.57           3644.23           3683.56
4096                 6490.05           6512.25           6770.42           7142.30
8192                 6179.04           6299.27           6440.49           6569.53
16384                9834.20           9816.93           9981.40          10092.17
32768               17905.61          19264.67          20128.34          20196.77
65536               21912.67          21985.26          22051.84          22077.84
131072              23094.27          23095.12          23117.46          23123.57
262144              22651.96          23504.02          23533.56          23533.59
524288              23738.26          23722.58          23783.92          23794.72
1048576             23882.89          23888.49          23904.83          23907.01
2097152             23957.94          23958.36          23964.79          23965.34
4194304             23987.95          23987.68          23989.67          23992.34

All-to-All (D)

Intra-node

MPI

$ srun -N1 --ntasks-per-node=4 osu_alltoall --type mpi_float --message-size 32:2^22:2 --full --tail-lat --accelerator cuda

# OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.5.2
# Datatype: MPI_FLOAT.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
32                     15.25             14.31             16.46        1000             15.22             16.17             16.87
64                     16.08             14.69             17.87        1000             15.55             18.36             19.66
128                    15.53             14.61             16.94        1000             15.49             16.47             17.33
256                    16.38             14.93             18.15        1000             16.09             18.53             19.97
512                    69.70             68.64             72.44        1000             69.59             71.81             73.93
1024                   49.85             46.85             51.07        1000             51.14             53.99             55.45
2048                   49.28             46.64             51.94        1000             50.37             52.29             53.73
4096                   51.08             48.02             52.36        1000             52.30             55.09             56.87
8192                   50.15             47.36             52.85        1000             51.14             53.19             55.90
16384                  58.21             54.70             59.65         100             57.46             62.49             65.71
32768                  75.66             73.52             78.35         100             75.39             76.64             81.51
65536                  60.52             58.09             63.40         100             60.49             61.61             63.79
131072                 62.68             59.95             65.83         100             62.66             64.14             67.01
262144                 66.70             64.05             69.78         100             66.38             68.11             72.96
524288                 72.18             69.52             75.29         100             71.90             74.22             76.09
1048576                85.34             82.65             88.53         100             85.21             86.82             88.73
2097152               111.47            108.85            114.54         100            111.16            113.56            115.64
4194304               162.00            159.58            164.91         100            161.86            163.36            166.18

NCCL

$ srun -N1 --ntasks-per-node=4 alltoall_perf -b 32 -e 4M -f 2 -g 1
# nccl-tests version 2.17.6 nccl-headers=22902 nccl-library=22902
# Collective test starting: alltoall_perf
# nThread 1 nGpus 1 minBytes 32 maxBytes 4194304 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 135184 on  nid005667 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  1 Group  0 Pid 135185 on  nid005667 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  2 Group  0 Pid 135186 on  nid005667 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  3 Group  0 Pid 135187 on  nid005667 device  3 [0039:01:00] NVIDIA GH200 120GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong                     
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)                             
           0             0     float    none      -1     0.23    0.00    0.00       0     0.15    0.00    0.00    N/A
          64             4     float    none      -1     7.01    0.01    0.01       0     7.06    0.01    0.01    N/A
         128             8     float    none      -1     9.20    0.01    0.01       0     7.14    0.02    0.01    N/A
         256            16     float    none      -1     8.62    0.03    0.02       0     7.10    0.04    0.03    N/A
         512            32     float    none      -1     8.50    0.06    0.05       0     7.00    0.07    0.05    N/A
        1024            64     float    none      -1     8.54    0.12    0.09       0     7.06    0.15    0.11    N/A
        2048           128     float    none      -1     8.80    0.23    0.17       0     7.06    0.29    0.22    N/A
        4096           256     float    none      -1     8.37    0.49    0.37       0     7.22    0.57    0.43    N/A
        8192           512     float    none      -1     7.54    1.09    0.81       0     7.26    1.13    0.85    N/A
       16384          1024     float    none      -1     8.18    2.00    1.50       0     7.86    2.08    1.56    N/A
       32768          2048     float    none      -1   211.76    0.15    0.12       0     8.53    3.84    2.88    N/A
       65536          4096     float    none      -1    10.45    6.27    4.70       0     9.84    6.66    5.00    N/A
      131072          8192     float    none      -1    12.08   10.85    8.14       0    11.82   11.09    8.32    N/A
      262144         16384     float    none      -1    13.94   18.80   14.10       0    13.71   19.13   14.34    N/A
      524288         32768     float    none      -1    14.24   36.81   27.60       0    14.23   36.83   27.63    N/A
     1048576         65536     float    none      -1    14.93   70.25   52.69       0    14.75   71.09   53.31    N/A
     2097152        131072     float    none      -1    17.35  120.84   90.63       0    17.05  123.02   92.27    N/A
     4194304        262144     float    none      -1    23.09  181.64  136.23       0    23.05  181.98  136.49    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 18.9095 
#
# Collective test concluded: alltoall_perf

NVSHMEM

$ srun -N1 --mpi=pmi2 --ntasks-per-node=4 /user-environment/env/default/bin/perftest/device/coll/alltoall_latency --datatype float -b 32 -e 4194304 -f 2
Runtime options after parsing command line arguments 
min_size: 32, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: float, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          8         32-bit    thread    6.326400          0.005         0.004       
64          16        32-bit    thread    6.243200          0.010         0.008       
128         32        32-bit    thread    6.265600          0.020         0.015       
256         64        32-bit    thread    7.052800          0.036         0.027       
512         128       32-bit    thread    8.720000          0.059         0.044       
1024        256       32-bit    thread    11.894400         0.086         0.065       
2048        512       32-bit    thread    17.497601         0.117         0.088       
4096        1024      32-bit    thread    29.183999         0.140         0.105       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          8         32-bit    warp      6.998400          0.005         0.003       
64          16        32-bit    warp      6.940800          0.009         0.007       
128         32        32-bit    warp      6.995200          0.018         0.014       
256         64        32-bit    warp      7.036800          0.036         0.027       
512         128       32-bit    warp      6.880000          0.074         0.056       
1024        256       32-bit    warp      7.030400          0.146         0.109       
2048        512       32-bit    warp      7.062400          0.290         0.217       
4096        1024      32-bit    warp      7.856000          0.521         0.391       
8192        2048      32-bit    warp      9.376000          0.874         0.655       
16384       4096      32-bit    warp      11.820800         1.386         1.040       
32768       8192      32-bit    warp      17.673600         1.854         1.391       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          8         32-bit    block     6.288000          0.005         0.004       
64          16        32-bit    block     6.048000          0.011         0.008       
128         32        32-bit    block     6.156800          0.021         0.016       
256         64        32-bit    block     6.220800          0.041         0.031       
512         128       32-bit    block     6.419200          0.080         0.060       
1024        256       32-bit    block     6.969600          0.147         0.110       
2048        512       32-bit    block     6.924800          0.296         0.222       
4096        1024      32-bit    block     6.857600          0.597         0.448       
8192        2048      32-bit    block     6.137600          1.335         1.001       
16384       4096      32-bit    block     6.915200          2.369         1.777       
32768       8192      32-bit    block     7.833600          4.183         3.137       
65536       16384     32-bit    block     9.379200          6.987         5.241       
131072      32768     32-bit    block     12.681600         10.336        7.752       
262144      65536     32-bit    block     18.508799         14.163        10.622      
524288      131072    32-bit    block     31.523201         16.632        12.474      
1048576     262144    32-bit    block     55.539203         18.880        14.160      
2097152     524288    32-bit    block     105.180800        19.939        14.954      
4194304     1048576   32-bit    block     203.609610        20.600        15.450      
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          4         64-bit    thread    6.070400          0.005         0.004       
64          8         64-bit    thread    6.118400          0.010         0.008       
128         16        64-bit    thread    6.947200          0.018         0.014       
256         32        64-bit    thread    6.956800          0.037         0.028       
512         64        64-bit    thread    8.579200          0.060         0.045       
1024        128       64-bit    thread    11.952000         0.086         0.064       
2048        256       64-bit    thread    17.680000         0.116         0.087       
4096        512       64-bit    thread    29.161599         0.140         0.105       
8192        1024      64-bit    thread    52.060801         0.157         0.118       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          4         64-bit    warp      6.940800          0.005         0.003       
64          8         64-bit    warp      6.956800          0.009         0.007       
128         16        64-bit    warp      6.956800          0.018         0.014       
256         32        64-bit    warp      6.963200          0.037         0.028       
512         64        64-bit    warp      6.902400          0.074         0.056       
1024        128       64-bit    warp      6.912000          0.148         0.111       
2048        256       64-bit    warp      7.017600          0.292         0.219       
4096        512       64-bit    warp      7.734400          0.530         0.397       
8192        1024      64-bit    warp      9.344000          0.877         0.658       
16384       2048      64-bit    warp      11.820800         1.386         1.040       
32768       4096      64-bit    warp      17.651200         1.856         1.392       
65536       8192      64-bit    warp      29.078400         2.254         1.690       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          4         64-bit    block     6.192000          0.005         0.004       
64          8         64-bit    block     6.118400          0.010         0.008       
128         16        64-bit    block     6.083200          0.021         0.016       
256         32        64-bit    block     7.001600          0.037         0.027       
512         64        64-bit    block     6.099200          0.084         0.063       
1024        128       64-bit    block     6.131200          0.167         0.125       
2048        256       64-bit    block     6.070400          0.337         0.253       
4096        512       64-bit    block     6.128000          0.668         0.501       
8192        1024      64-bit    block     6.892800          1.188         0.891       
16384       2048      64-bit    block     7.020800          2.334         1.750       
32768       4096      64-bit    block     7.676800          4.268         3.201       
65536       8192      64-bit    block     9.449600          6.935         5.201       
131072      16384     64-bit    block     12.633599         10.375        7.781       
262144      32768     64-bit    block     18.358400         14.279        10.709      
524288      65536     64-bit    block     30.732799         17.060        12.795      
1048576     131072    64-bit    block     55.337602         18.949        14.212      
2097152     262144    64-bit    block     105.212796        19.932        14.949      
4194304     524288    64-bit    block     203.596807        20.601        15.451

Inter-node

MPI

$ srun -N2 --ntasks-per-node=4 osu_alltoall --type mpi_float --message-size 32:2^22:2 --full --tail-lat --accelerator cuda

# OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.5.2
# Datatype: MPI_FLOAT.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
32                     22.28             21.38             23.47        1000             22.45             23.44             24.38
64                     23.54             21.51             25.81        1000             23.48             24.71             26.72
128                    22.69             21.49             23.88        1000             22.32             24.46             27.03
256                    23.52             20.25             28.17        1000             23.48             24.18             27.85
512                    77.04             74.06             82.02        1000             76.80             78.17             81.78
1024                   58.84             57.27             61.24        1000             58.56             62.21             64.34
2048                   59.13             57.47             61.63        1000             58.35             61.99             66.19
4096                   58.88             56.83             61.52        1000             58.73             61.99             64.62
8192                   61.58             60.13             63.84        1000             61.32             64.33             67.07
16384                  64.65             61.80             67.21         100             64.63             67.01             68.93
32768                  61.13             59.79             62.67         100             60.69             64.45             66.66
65536                  63.45             60.34             64.62         100             62.66             66.41             70.89
131072                 73.05             68.47             75.09         100             72.49             76.65             78.53
262144                 98.37             96.39             99.49         100             98.22            101.90            103.79
524288                229.08            206.31            238.37         100            228.44            233.12            236.24
1048576               385.98            333.97            405.76         100            385.95            388.89            391.36
2097152               693.74            580.29            726.60         100            693.05            697.43            702.69
4194304              1318.74           1088.47           1378.24         100           1319.10           1324.51           1332.15

NCCL

$ srun -N2 --ntasks-per-node=4 alltoall_perf -b 32 -e 4M -f 2 -g 1
# nccl-tests version 2.17.6 nccl-headers=22902 nccl-library=22902
# Collective test starting: alltoall_perf
# nThread 1 nGpus 1 minBytes 32 maxBytes 4194304 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 287862 on  nid005997 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  1 Group  0 Pid 287863 on  nid005997 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  2 Group  0 Pid 287864 on  nid005997 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  3 Group  0 Pid 287865 on  nid005997 device  3 [0039:01:00] NVIDIA GH200 120GB
#  Rank  4 Group  0 Pid 158509 on  nid006000 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  5 Group  0 Pid 158510 on  nid006000 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  6 Group  0 Pid 158511 on  nid006000 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  7 Group  0 Pid 158513 on  nid006000 device  3 [0039:01:00] NVIDIA GH200 120GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong                     
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)                             
           0             0     float    none      -1     0.45    0.00    0.00       0     0.24    0.00    0.00    N/A
           0             0     float    none      -1     0.24    0.00    0.00       0     0.24    0.00    0.00    N/A
         128             4     float    none      -1    30.00    0.00    0.00       0    28.12    0.00    0.00    N/A
         256             8     float    none      -1    28.97    0.01    0.01       0    29.64    0.01    0.01    N/A
         512            16     float    none      -1    31.66    0.02    0.01       0    31.72    0.02    0.01    N/A
        1024            32     float    none      -1    30.11    0.03    0.03       0    28.22    0.04    0.03    N/A
        2048            64     float    none      -1    28.63    0.07    0.06       0    28.77    0.07    0.06    N/A
        4096           128     float    none      -1    29.44    0.14    0.12       0    28.50    0.14    0.13    N/A
        8192           256     float    none      -1    29.41    0.28    0.24       0    28.29    0.29    0.25    N/A
       16384           512     float    none      -1    31.79    0.52    0.45       0    31.57    0.52    0.45    N/A
       32768          1024     float    none      -1    32.09    1.02    0.89       0    32.32    1.01    0.89    N/A
       65536          2048     float    none      -1    33.09    1.98    1.73       0    33.99    1.93    1.69    N/A
      131072          4096     float    none      -1    38.58    3.40    2.97       0    39.07    3.36    2.94    N/A
      262144          8192     float    none      -1    43.21    6.07    5.31       0   267.68    0.98    0.86    N/A
      524288         16384     float    none      -1    56.33    9.31    8.14       0    56.28    9.32    8.15    N/A
     1048576         32768     float    none      -1   169.77    6.18    5.40       0    68.42   15.33   13.41    N/A
     2097152         65536     float    none      -1    94.60   22.17   19.40       0    94.31   22.24   19.46    N/A
     4194304        131072     float    none      -1   144.42   29.04   25.41       0   145.75   28.78   25.18    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.99221 

NVSHMEM

$ srun -N2 --mpi=pmi2 --ntasks-per-node=4 /user-environment/env/default/bin/perftest/device/coll/alltoall_latency --datatype float -b 32 -e 4194304 -f 2
Runtime options after parsing command line arguments 
min_size: 32, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: float, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          8         32-bit    thread    111.270404        0.000         0.000       
64          16        32-bit    thread    112.592006        0.001         0.000       
128         32        32-bit    thread    111.945605        0.001         0.001       
256         64        32-bit    thread    111.088002        0.002         0.002       
512         128       32-bit    thread    111.020803        0.005         0.004       
1024        256       32-bit    thread    110.992002        0.009         0.008       
2048        512       32-bit    thread    110.272002        0.019         0.016       
4096        1024      32-bit    thread    110.204804        0.037         0.033       
8192        2048      32-bit    thread    110.185599        0.074         0.065       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          8         32-bit    warp      92.220801         0.000         0.000       
64          16        32-bit    warp      91.321599         0.001         0.001       
128         32        32-bit    warp      89.769602         0.001         0.001       
256         64        32-bit    warp      90.495998         0.003         0.002       
512         128       32-bit    warp      88.931203         0.006         0.005       
1024        256       32-bit    warp      89.708799         0.011         0.010       
2048        512       32-bit    warp      89.715201         0.023         0.020       
4096        1024      32-bit    warp      89.004803         0.046         0.040       
8192        2048      32-bit    warp      89.769602         0.091         0.080       
16384       4096      32-bit    warp      90.553600         0.181         0.158       
32768       8192      32-bit    warp      89.779198         0.365         0.319       
65536       16384     32-bit    warp      90.521598         0.724         0.633       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          8         32-bit    block     91.411197         0.000         0.000       
64          16        32-bit    block     89.654398         0.001         0.001       
128         32        32-bit    block     89.679998         0.001         0.001       
256         64        32-bit    block     88.995200         0.003         0.003       
512         128       32-bit    block     88.889599         0.006         0.005       
1024        256       32-bit    block     88.809597         0.012         0.010       
2048        512       32-bit    block     88.153601         0.023         0.020       
4096        1024      32-bit    block     88.095999         0.046         0.041       
8192        2048      32-bit    block     88.041598         0.093         0.081       
16384       4096      32-bit    block     90.451199         0.181         0.158       
32768       8192      32-bit    block     88.908798         0.369         0.322       
65536       16384     32-bit    block     88.083202         0.744         0.651       
131072      32768     32-bit    block     88.934398         1.474         1.290       
262144      65536     32-bit    block     91.379201         2.869         2.510       
524288      131072    32-bit    block     101.139200        5.184         4.536       
1048576     262144    32-bit    block     115.872002        9.049         7.918       
2097152     524288    32-bit    block     138.835204        15.105        13.217      
4194304     1048576   32-bit    block     187.180805        22.408        19.607      
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
64          8         64-bit    thread    110.265601        0.001         0.001       
128         16        64-bit    thread    111.865604        0.001         0.001       
256         32        64-bit    thread    109.404802        0.002         0.002       
512         64        64-bit    thread    108.643198        0.005         0.004       
1024        128       64-bit    thread    109.443200        0.009         0.008       
2048        256       64-bit    thread    108.694398        0.019         0.016       
4096        512       64-bit    thread    110.223997        0.037         0.033       
8192        1024      64-bit    thread    109.289598        0.075         0.066       
16384       2048      64-bit    thread    110.195196        0.149         0.130       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
64          8         64-bit    warp      90.620798         0.001         0.001       
128         16        64-bit    warp      89.724803         0.001         0.001       
256         32        64-bit    warp      89.724803         0.003         0.002       
512         64        64-bit    warp      89.699203         0.006         0.005       
1024        128       64-bit    warp      89.779198         0.011         0.010       
2048        256       64-bit    warp      89.631999         0.023         0.020       
4096        512       64-bit    warp      89.737600         0.046         0.040       
8192        1024      64-bit    warp      90.464002         0.091         0.079       
16384       2048      64-bit    warp      89.712000         0.183         0.160       
32768       4096      64-bit    warp      90.486401         0.362         0.317       
65536       8192      64-bit    warp      89.708799         0.731         0.639       
131072      16384     64-bit    warp      90.553600         1.447         1.267       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
64          8         64-bit    block     90.627199         0.001         0.001       
128         16        64-bit    block     91.353601         0.001         0.001       
256         32        64-bit    block     88.956797         0.003         0.003       
512         64        64-bit    block     88.172799         0.006         0.005       
1024        128       64-bit    block     88.169599         0.012         0.010       
2048        256       64-bit    block     88.047999         0.023         0.020       
4096        512       64-bit    block     88.063997         0.047         0.041       
8192        1024      64-bit    block     89.033598         0.092         0.081       
16384       2048      64-bit    block     88.134402         0.186         0.163       
32768       4096      64-bit    block     88.908798         0.369         0.322       
65536       8192      64-bit    block     89.641601         0.731         0.640       
131072      16384     64-bit    block     88.876802         1.475         1.290       
262144      32768     64-bit    block     92.915201         2.821         2.469       
524288      65536     64-bit    block     100.486398        5.218         4.565       
1048576     131072    64-bit    block     112.611198        9.311         8.148       
2097152     262144    64-bit    block     137.993598        15.197        13.298      
4194304     524288    64-bit    block     186.435199        22.497        19.685

@boeschf
Copy link
Contributor Author

boeschf commented Dec 9, 2025

cscs-ci run alps;system=daint;uarch=gh200;uenv=pytorch:v2.9.1

@boeschf boeschf marked this pull request as draft December 9, 2025 15:52
@boeschf boeschf marked this pull request as ready for review December 10, 2025 09:35
@boeschf
Copy link
Contributor Author

boeschf commented Dec 10, 2025

tested with Apertus 8b pretraining

@boeschf
Copy link
Contributor Author

boeschf commented Dec 10, 2025

cscs-ci run alps;system=clariden;uarch=gh200;uenv=pytorch:v2.9.1

@boeschf
Copy link
Contributor Author

boeschf commented Dec 10, 2025

cscs-ci run alps;system=santis;uarch=gh200;uenv=pytorch:v2.9.1

@boeschf
Copy link
Contributor Author

boeschf commented Dec 22, 2025

cscs-ci run alps;system=daint;uarch=gh200;uenv=pytorch:v2.9.1

@boeschf
Copy link
Contributor Author

boeschf commented Dec 23, 2025

cscs-ci run alps;system=daint;uarch=gh200;uenv=pytorch:v2.9.1

@boeschf
Copy link
Contributor Author

boeschf commented Jan 22, 2026

cscs-ci run alps;system=daint;uarch=gh200;uenv=pytorch:v2.9.1

1 similar comment
@boeschf
Copy link
Contributor Author

boeschf commented Jan 23, 2026

cscs-ci run alps;system=daint;uarch=gh200;uenv=pytorch:v2.9.1

@boeschf
Copy link
Contributor Author

boeschf commented Jan 26, 2026

cscs-ci run alps;system=daint;uarch=gh200;uenv=pytorch:v2.9.1

@boeschf
Copy link
Contributor Author

boeschf commented Feb 2, 2026

cscs-ci run alps;system=daint;uarch=gh200;uenv=pytorch:v2.9.1

Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. You might want to pin the spack commit for better reproducibility.

In general it'd be nice to understand what the patches and custom commits are needed for (because they might benefit others). Likewise it'd be nice to see changes upstreamed, but I won't block on these.

@boeschf
Copy link
Contributor Author

boeschf commented Feb 4, 2026

cscs-ci run alps;system=daint;uarch=gh200;uenv=pytorch:v2.9.1

@boeschf
Copy link
Contributor Author

boeschf commented Feb 4, 2026

cscs-ci run alps;system=clariden;uarch=gh200;uenv=pytorch:v2.9.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants