Optimize FFT logic for complex strided input to avoid oversized memory allocation#2939
Optimize FFT logic for complex strided input to avoid oversized memory allocation#2939vlad-perevezentsev wants to merge 5 commits into
Conversation
|
View rendered docs @ https://intelpython.github.io/dpnp/pull/2939/index.html |
|
Array API standard conformance tests for dpnp=0.21.0dev0=py313h509198e_60 ran successfully. |
| # If so, copy to contiguous to avoid oversized allocation | ||
| # for the output array and unnecessary copy to contiguous | ||
| # after oneMKL FFT | ||
| _strides = dpnp.get_usm_ndarray(a).strides |
There was a problem hiding this comment.
Why do we need to call get_usm_ndarray here?
| _shape = a.shape | ||
| # Max element displacement reachable by the strides. | ||
| # Negative strides are handled by _copy_array, so only | ||
| # positive strides are possible here |
There was a problem hiding this comment.
There is another regression also, which can be handled separately:
import dpnp as np
a = np.arange(4, dtype='c8')
b = np.broadcast_to(a, (3, 4))
np.fft.fft(b, axis=0)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 1
----> 1 np.fft.fft(b, axis=0)
File ~/code/dpnp/dpnp/fft/dpnp_iface_fft.py:122, in fft(a, n, axis, norm, out)
47 """
48 Compute the one-dimensional discrete Fourier Transform.
49
(...) 118
119 """
121 dpnp.check_supported_arrays_type(a)
--> 122 return dpnp_fft(
123 a, forward=True, real=False, n=n, axis=axis, norm=norm, out=out
124 )
File ~/code/dpnp/dpnp/fft/dpnp_utils_fft.py:664, in dpnp_fft(a, forward, real, n, axis, norm, out)
658 if c2r:
659 # input array should be Hermitian for c2r FFT
660 a = _make_array_hermitian(
661 a, axis, dpnp.are_same_logical_tensors(a, a_orig)
662 )
--> 664 return _fft(
665 a,
666 norm=norm,
667 out=out,
668 forward=forward,
669 # TODO: currently in-place is only implemented for c2c, see SAT-7154
670 in_place=in_place and c2c,
671 c2c=c2c,
672 axes=axis,
673 batch_fft=a_ndim != 1,
674 )
File ~/code/dpnp/dpnp/fft/dpnp_utils_fft.py:447, in _fft(a, norm, out, forward, in_place, c2c, axes, batch_fft)
443 a_strides = _standardize_strides_to_nonzero(strides, a.shape)
444 dsc, out_strides = _commit_descriptor(
445 a, forward, in_place, c2c, a_strides, index, batch_fft
446 )
--> 447 res = _compute_result(dsc, a, out, forward, c2c, out_strides)
448 res = _scale_result(res, a.shape, norm, forward, index)
450 # Revert swapped axes
File ~/code/dpnp/dpnp/fft/dpnp_utils_fft.py:239, in _compute_result(dsc, a, out, forward, c2c, out_strides)
231 result = dpnp_array(
232 out_shape,
233 dtype=out_dtype,
(...) 236 sycl_queue=exec_q,
237 )
238 res_usm = result.get_array()
--> 239 ht_fft_event, fft_event = fi._fft_out_of_place(
240 dsc, a_usm, res_usm, forward, depends=dep_evs
241 )
242 _manager.add_event_pair(ht_fft_event, fft_event)
244 if not isinstance(result, dpnp_array):
ValueError: Memory addressed by the output array is not sufficiently ample.| _shape = a.shape | ||
| # Max element displacement reachable by the strides. | ||
| # Negative strides are handled by _copy_array, so only | ||
| # positive strides are possible here |
There was a problem hiding this comment.
zero strides in case of broadcasting will also go that path, so the comment is not fully correct
| max_disp = sum( | ||
| st * (sh - 1) for st, sh in zip(_strides, _shape) if st > 0 | ||
| ) | ||
| if (max_disp + 1) > a.size: |
There was a problem hiding this comment.
It'd be helpful to add dedicated tests covering both copy path and no-copy path with transposed/F-contig complex input.
| if ( | ||
| dpnp.is_cuda_backend(a) and not a.flags.c_contiguous | ||
| ): # pragma: no cover | ||
| if dpnp.is_cuda_backend(a): # pragma: no cover |
There was a problem hiding this comment.
Previously there was no copy for CUDA branch if batch_fft=False.
I guess it's indented change, based on the above comment that C-contig is a requirement for cuFFT without any exception. Then we need to update the PR description at least, mentioning that.
This PR fixes an issue discovered while implementing #2927 where several FFT tests started failing after adding validation for stride configurations with oversized memory footprints.
The problem was that FFT could allocate output arrays using the same strided layout as a non-contiguous input. For inputs such as
a[::2, this resulted in oversized allocations followed by an additional copy to a contiguous array.This fix checks whether the memory footprint implied by the input strides exceeds the number of elements in the array. If so, the input is copied to a contiguous layout before configuring the FFT descriptor allowing oneMKL FFT to produce a contiguous output directly
By avoiding oversized allocations and the extra copy this significantly improves the performance of all FFT operations in dpnp with complex strided inputs