Hi,
after running a memory trace on the End 2 End example notebook with the techniques here, I am finding that the pyTorch graph compiler is a lot less likely to eliminate broadcast tensors from being materialized by fusing expressions than TensorFlow. In other words, the size of simulation scenarios in the SLS is suffering from restrictions that I haven't observed with TensorFlow and Sionna 1.x
In that End2End downlink example, I am seeing two places in the code in this notebook that results in huge allocation spikes, using torch 2.10.0
- src/sionna/phy/channel/utils.py line 297
h_f = h * e
This broadcast h * e expands to P times the size of the reduced product h_f = h_f.sum(dim=-3) by being materialized on the GPU with P representing the number of paths used by the channel model (e.g. often 24 for UMa).
- src/sionna/phy/ofdm/precoding.py line 352
h_eff = h @ g
For num_ofdm_symbols = 14 in that End 2 End example, the input and output tensors have these shapes
h shape: (1, 210, 21, 14, 128, 1, 12) ~0.707 GiB
g shape: (1, 1, 21, 14, 128, 12, 10) ~0.034 GiB
h_eff shape: (1, 210, 21, 14, 128, 1, 10) ~0.589 GiB
h_eff shape: (1, 210, 1, 21, 10, 14, 128) ~0.589 GiB (after permute)
The broadcast performed during the multiplication is fully materialized on the GPU even in code running under @torch.compile and requires 210 times the size of the input tensor g (because it's being broadcast 210 times), making a ~7GiB allocation.
Would the Sionna team consider deploying workarounds to prevent the materialization of such broadcasts, or do you consider this a problem of pyTorch which should be fixed on their end?
Hi,
after running a memory trace on the End 2 End example notebook with the techniques here, I am finding that the pyTorch graph compiler is a lot less likely to eliminate broadcast tensors from being materialized by fusing expressions than TensorFlow. In other words, the size of simulation scenarios in the SLS is suffering from restrictions that I haven't observed with TensorFlow and Sionna 1.x
In that End2End downlink example, I am seeing two places in the code in this notebook that results in huge allocation spikes, using torch 2.10.0
h_f = h * e
This broadcast h * e expands to P times the size of the reduced product h_f = h_f.sum(dim=-3) by being materialized on the GPU with P representing the number of paths used by the channel model (e.g. often 24 for UMa).
h_eff = h @ g
For num_ofdm_symbols = 14 in that End 2 End example, the input and output tensors have these shapes
The broadcast performed during the multiplication is fully materialized on the GPU even in code running under @torch.compile and requires 210 times the size of the input tensor g (because it's being broadcast 210 times), making a ~7GiB allocation.
Would the Sionna team consider deploying workarounds to prevent the materialization of such broadcasts, or do you consider this a problem of pyTorch which should be fixed on their end?