Skip to content

Optimize FFT logic for complex strided input to avoid oversized memory allocation#2939

Open
vlad-perevezentsev wants to merge 5 commits into
masterfrom
fix_fft_logic
Open

Optimize FFT logic for complex strided input to avoid oversized memory allocation#2939
vlad-perevezentsev wants to merge 5 commits into
masterfrom
fix_fft_logic

Conversation

@vlad-perevezentsev

Copy link
Copy Markdown
Contributor

This PR fixes an issue discovered while implementing #2927 where several FFT tests started failing after adding validation for stride configurations with oversized memory footprints.

The problem was that FFT could allocate output arrays using the same strided layout as a non-contiguous input. For inputs such as a[::2, this resulted in oversized allocations followed by an additional copy to a contiguous array.

This fix checks whether the memory footprint implied by the input strides exceeds the number of elements in the array. If so, the input is copied to a contiguous layout before configuring the FFT descriptor allowing oneMKL FFT to produce a contiguous output directly

By avoiding oversized allocations and the extra copy this significantly improves the performance of all FFT operations in dpnp with complex strided inputs

fft_perf
  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary?
  • Have you added your changes to the changelog?

@vlad-perevezentsev vlad-perevezentsev added this to the 0.21.0 release milestone Jun 1, 2026
@vlad-perevezentsev vlad-perevezentsev self-assigned this Jun 1, 2026
@vlad-perevezentsev vlad-perevezentsev changed the title Optimize FFT logic for strided input to avoid oversized memory allocation Optimize FFT logic for complex strided input to avoid oversized memory allocation Jun 1, 2026
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

View rendered docs @ https://intelpython.github.io/dpnp/pull/2939/index.html

@coveralls

coveralls commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Coverage Status

coverage: 78.255% (+0.007%) from 78.248% — fix_fft_logic into master

Comment thread dpnp/fft/dpnp_utils_fft.py
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Array API standard conformance tests for dpnp=0.21.0dev0=py313h509198e_60 ran successfully.
Passed: 1356
Failed: 4
Skipped: 16

# If so, copy to contiguous to avoid oversized allocation
# for the output array and unnecessary copy to contiguous
# after oneMKL FFT
_strides = dpnp.get_usm_ndarray(a).strides

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to call get_usm_ndarray here?

_shape = a.shape
# Max element displacement reachable by the strides.
# Negative strides are handled by _copy_array, so only
# positive strides are possible here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another regression also, which can be handled separately:

import dpnp as np

a = np.arange(4, dtype='c8')
b = np.broadcast_to(a, (3, 4))

np.fft.fft(b, axis=0)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 np.fft.fft(b, axis=0)

File ~/code/dpnp/dpnp/fft/dpnp_iface_fft.py:122, in fft(a, n, axis, norm, out)
     47 """
     48 Compute the one-dimensional discrete Fourier Transform.
     49
   (...)    118
    119 """
    121 dpnp.check_supported_arrays_type(a)
--> 122 return dpnp_fft(
    123     a, forward=True, real=False, n=n, axis=axis, norm=norm, out=out
    124 )

File ~/code/dpnp/dpnp/fft/dpnp_utils_fft.py:664, in dpnp_fft(a, forward, real, n, axis, norm, out)
    658 if c2r:
    659     # input array should be Hermitian for c2r FFT
    660     a = _make_array_hermitian(
    661         a, axis, dpnp.are_same_logical_tensors(a, a_orig)
    662     )
--> 664 return _fft(
    665     a,
    666     norm=norm,
    667     out=out,
    668     forward=forward,
    669     # TODO: currently in-place is only implemented for c2c, see SAT-7154
    670     in_place=in_place and c2c,
    671     c2c=c2c,
    672     axes=axis,
    673     batch_fft=a_ndim != 1,
    674 )

File ~/code/dpnp/dpnp/fft/dpnp_utils_fft.py:447, in _fft(a, norm, out, forward, in_place, c2c, axes, batch_fft)
    443 a_strides = _standardize_strides_to_nonzero(strides, a.shape)
    444 dsc, out_strides = _commit_descriptor(
    445     a, forward, in_place, c2c, a_strides, index, batch_fft
    446 )
--> 447 res = _compute_result(dsc, a, out, forward, c2c, out_strides)
    448 res = _scale_result(res, a.shape, norm, forward, index)
    450 # Revert swapped axes

File ~/code/dpnp/dpnp/fft/dpnp_utils_fft.py:239, in _compute_result(dsc, a, out, forward, c2c, out_strides)
    231         result = dpnp_array(
    232             out_shape,
    233             dtype=out_dtype,
   (...)    236             sycl_queue=exec_q,
    237         )
    238         res_usm = result.get_array()
--> 239     ht_fft_event, fft_event = fi._fft_out_of_place(
    240         dsc, a_usm, res_usm, forward, depends=dep_evs
    241     )
    242 _manager.add_event_pair(ht_fft_event, fft_event)
    244 if not isinstance(result, dpnp_array):

ValueError: Memory addressed by the output array is not sufficiently ample.

_shape = a.shape
# Max element displacement reachable by the strides.
# Negative strides are handled by _copy_array, so only
# positive strides are possible here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zero strides in case of broadcasting will also go that path, so the comment is not fully correct

max_disp = sum(
st * (sh - 1) for st, sh in zip(_strides, _shape) if st > 0
)
if (max_disp + 1) > a.size:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be helpful to add dedicated tests covering both copy path and no-copy path with transposed/F-contig complex input.

if (
dpnp.is_cuda_backend(a) and not a.flags.c_contiguous
): # pragma: no cover
if dpnp.is_cuda_backend(a): # pragma: no cover

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously there was no copy for CUDA branch if batch_fft=False.
I guess it's indented change, based on the above comment that C-contig is a requirement for cuFFT without any exception. Then we need to update the PR description at least, mentioning that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants