host: Add AVX2 support for uhd::convert by anilgurses · Pull Request #789 · EttusResearch/uhd

anilgurses · 2024-09-29T23:13:40Z

Pull Request Details

Description

AVX2 support is implemented for uhd::convert. It was previously limited with sse2. It provides performance improvements for data type conversion.

Related Issue

N/A

Which devices/areas does this affect?

Affects the uhd::convert data conversion performance.

Testing Done

Testing is done using the tests written previously by UHD developers. It passes all previous tests and there is no need for new tests.

Checklist

I have read the CONTRIBUTING document.
My code follows the code style of this project. See CODING.md.
I have updated the documentation accordingly.
I have added tests to cover my changes, and all previous tests pass.
I have checked all compat numbers if they need updating (FPGA compat,
MPM compat, noc_shell, specific RFNoC block, ...)

github-actions · 2025-03-24T22:09:11Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

anilgurses · 2025-03-24T22:11:43Z

I have read the CLA Document and I hereby sign the CLA

anilgurses · 2025-10-27T00:43:27Z

Hi! Is there anything else needed for this PR?

mbr0wn · 2025-11-04T11:21:22Z

Hey @anilgurses, sorry for never responding here. The problem is that AVX2 support is not ubiquitous, and we need a way to only deploy it on demand. Something like a glibc conditional dispatch.

I was also thinking of merging this, but leaving it disabled unless explicitly enabled at compile time (this would not, for example, be the case for .deb files we distribute). But that's also work.

mbr0wn · 2025-11-04T11:23:11Z

host/lib/convert/CMakeLists.txt

 ########################################################################
+
+# Check for SSE2 support
+check_cxx_compiler_flag("-msse2" SSE2_SUPPORTED)


All of this assumes the compiling machine has the same arch as the executing machine.

mbr0wn · 2025-11-04T11:26:13Z

host/lib/convert/CMakeLists.txt

+# Check for AVX2 support
+check_cxx_compiler_flag("-mavx512" AVX512_SUPPORTED)
+if(AVX512_SUPPORTED)
+    message(STATUS "AVX512 is supported")


This means AVX512 is supported by the compiler, not that it's also supported by the CPU.

anilgurses · 2025-11-04T15:45:31Z

Thanks for the feedback! You are right. Let me check if I can find time to implement on-demand AVX512. I'll update this PR once it's ready.

anilgurses · 2026-02-03T05:51:35Z

Sorry for the delay, I've worked on it and developed runtime dispatch of SIMD functions for converters. I've also added a converter benchmark tool under tests. I was wondering how much performance gain is achieved on which instruction. I'm putting the table below. The baseline for comparison is compiler optimizations for the generic converter. It is not surprising that avx2 performs better at bigger packets. I might implement avx512 and test it with that as well. I've been using the previous version of the code for a year. I didn't encounter any issues but I realized that I made a big mistake on the SIMD_PRIORITY, which I have fixed. I ran my tests on Xeon Gold 6240. Let me know if I am missing something or I can improve my PR.

================================================================================
Summary: Fastest Converter for Each Type
================================================================================
                         Conversion  Best Priority      ns/sample Speedup vs Gen
--------------------------------------------------------------------------------
             fc32 -> sc12_item32_le     SSE2/SSSE3          0.683          2.13x
                  fc32 -> sc16_chdr           AVX2          0.489          2.20x
             fc32 -> sc16_item32_be           AVX2          0.491          4.05x
             fc32 -> sc16_item32_le           AVX2          0.490          2.05x
              fc32 -> sc8_item32_le           AVX2          0.420          3.20x
                  fc64 -> sc16_chdr           AVX2          0.748          2.23x
             fc64 -> sc16_item32_le           AVX2          0.747          2.44x
             sc12_item32_le -> fc32     SSE2/SSSE3          0.530          2.51x
             sc12_item32_le -> sc16     SSE2/SSSE3          0.331          2.24x
             sc16 -> sc12_item32_le     SSE2/SSSE3          0.390          2.15x
             sc16 -> sc16_item32_be           AVX2          0.288          1.80x
             sc16 -> sc16_item32_le           AVX2          0.287          1.39x
                  sc16_chdr -> fc32     SSE2/SSSE3          0.440          1.00x
                  sc16_chdr -> fc64     SSE2/SSSE3          0.736          1.41x
             sc16_item32_be -> fc32     SSE2/SSSE3          0.442          2.17x
             sc16_item32_be -> sc16     SSE2/SSSE3          0.291          2.36x
             sc16_item32_le -> fc32           AVX2          0.487          1.31x
             sc16_item32_le -> fc64     SSE2/SSSE3          0.737          1.55x
             sc16_item32_le -> sc16     SSE2/SSSE3          0.288          1.37x
              sc8_item32_le -> fc32     SSE2/SSSE3          0.369          2.09x

Copilot

Pull request overview

Adds AVX2-backed converter implementations and introduces runtime SIMD feature detection so UHD can select the best available uhd::convert implementation on a given CPU.

Changes:

Added runtime CPU SIMD feature detection (SSE2/SSSE3/AVX2/AVX512F) and new SIMD priority levels.
Implemented multiple AVX2 converters and updated existing SSE2/SSSE3 converters to register conditionally.
Updated build system and tests/examples to account for new priorities and benchmarking.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
host/tests/convert_test.cpp	Updates priority list and changes benchmark test decorator behavior.
host/lib/convert/ssse3_unpack_sc12.cpp	Adds runtime SSSE3 check to avoid registering on unsupported CPUs.
host/lib/convert/ssse3_pack_sc12.cpp	Adds runtime SSSE3 check to avoid registering on unsupported CPUs.
host/lib/convert/sse2_sc8_to_fc64.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_sc8_to_fc32.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_sc16_to_sc16.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_sc16_to_fc64.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_sc16_to_fc32.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_fc64_to_sc8.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_fc64_to_sc16.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_fc32_to_sc8.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_fc32_to_sc16.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/simd_features.hpp	New header for runtime SIMD detection + logging helpers.
host/lib/convert/convert_impl.cpp	Logs SIMD capabilities during converter/item-size registration.
host/lib/convert/convert_common.hpp	Adds SIMD converter macros with runtime gating and AVX2/AVX512 priorities.
host/lib/convert/avx2_sc8_to_fc32.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_sc16_to_sc16.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_sc16_to_fc64.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_sc16_to_fc32.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_fc64_to_sc8.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_fc64_to_sc16.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_fc32_to_sc8.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_fc32_to_sc16.cpp	New AVX2 converter implementation(s).
host/lib/convert/CMakeLists.txt	Adds compiler-flag-based SIMD build detection and includes AVX2 sources.
host/examples/convert_benchmark.cpp	Adds a standalone converter benchmarking example.
host/examples/CMakeLists.txt	Builds the new `convert_benchmark` example.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

host/lib/convert/avx2_fc32_to_sc8.cpp

host/lib/convert/avx2_fc32_to_sc16.cpp

host/lib/convert/avx2_fc64_to_sc16.cpp

host/lib/convert/convert_common.hpp

host/lib/convert/avx2_fc64_to_sc16.cpp

host/examples/convert_benchmark.cpp

host/tests/convert_test.cpp

host/lib/convert/avx2_fc32_to_sc8.cpp

anilgurses · 2026-02-19T15:55:49Z

Idk how this AI review thing got into my PR but most of them were nonsense. I disabled it entirely from my github. I've corrected one of my mistake though. It's working fine on my test setup. Let me know if anything else is needed.

Note: This commit does not yet register the converters.

Add runtime dispatch for avx2 converters and converter benchmark Signed-off-by: Anıl Gürses <anilgurses98@gmail.com>

mbr0wn · 2026-02-24T09:29:41Z

@anilgurses OK I started a more in-depth review. I force-pushed to your branch to give you a clue where this is going. You did some great work here!

First, we need to split the converters from the dispatch logic. I don't think we're going to use that -- too many unknowns. There are better dispatch mechanisms (like glibc's dynamic dispatch based on available hardware) that we might look at. But basically, the dispatch logic is too invasive for my taste and our current bandwidth of reviewing and incorporating.

In the branch I pushed there are now 3 commits. First, just the converters. This commit doesn't do much by itself. Second, a commit by me that includes the AVX2 converters if you specify -DENABLE_AVX2. It's very light-weight, not a lot of dispatch logic. It simply replaces the SSE2 converters with AVX2. The third commit is everything else from you.

My main issue with this PR is that the AVX2 converters have very little performance gain over the SSE2 converters. I saw a few percent increase. Now that's not nothing -- when we're streaming, we'll take everything. It's just not enough for me to justify spending a lot of time creating a safe dispatch logic (remember, we need to run on Linux, Windows, Mac, but people also run on other OSes like *BSD, we support all bunch of distro versions, many architectures, etc.).

A couple of things:

I had to apply clang-format to make our CI happy.
There were a bunch of stale comments (looks like you copied the sse2 converters and modified them -- that makes a lot of sense, but you left a lot of SSE2 comments in there).
I'm not sure the alignment logic is correct everywhere. I think I caught an incorrect 16-byte alignment and fixed it, but I'm not sure it's good everywhere.

anilgurses · 2026-02-24T13:53:05Z

Thanks for the comments.

First, we need to split the converters from the dispatch logic. I don't think we're going to use that -- too many unknowns. There are better dispatch mechanisms (like glibc's dynamic dispatch based on available hardware) that we might look at.

Sure, let me check that. I will replace the current one with glibc's dynamic dispatch.

I had to apply clang-format to make our CI happy.

Sorry for that, I think I was using my local clang-format config. I will check this before I push it again.

There were a bunch of stale comments (looks like you copied the sse2 converters and modified them -- that makes a lot of sense, but you left a lot of SSE2 comments in there).

Yep, that's how I wrote it :) This one is easy to fix. I will fix them quickly.

I'm not sure the alignment logic is correct everywhere. I think I caught an incorrect 16-byte alignment and fixed it, but I'm not sure it's good everywhere.

I will double check before I ask you to review it again so you don't waste your time.

This feedback is very useful. I will try to finish them once I get the chance.

mbr0wn · 2026-02-25T08:49:04Z

Sure, let me check that. I will replace the current one with glibc's dynamic dispatch.

@anilgurses Please don't -- this is something that takes some major coordination, including with our partners who help us package UHD for the various Linux distros. And it also doesn't help with Windows, if we even care. And what if we want to dispatch different implementations of functions for other things than the converters?

I really appreciate your efforts and enthusiasm, but a solid, stable, portable, and future-proof dynamic dispatch of algorithms based on the available hardware is a major very invasive change to UHD. I'm sure you're skilled enough to implement that, but we wouldn't have the bandwidth (right now) to even properly review such a change, let alone roll it out for the upcoming UHD release.

I would like to suggest that we focus on the converters, and the "dumb" way of enabling them through -DENABLE_AVX2=ON for now, so we can close out this PR. I'm sorry if this is not as satisfying as auto-enabling this feature for everyone. I think between that, and using -fmarch=native, people will get plenty of optimizations out of just the compilers. And I realize you put a lot of work into the dispatch mechanism, but please understand that we cannot accept all code changes that come through this pipeline.

host/examples/convert_benchmark.cpp

host/lib/convert/avx2_sc16_to_fc64.cpp

host/lib/convert/avx2_sc16_to_sc16.cpp

anilgurses · 2026-02-25T16:29:13Z

I would like to suggest that we focus on the converters, and the "dumb" way of enabling them through -DENABLE_AVX2=ON for now, so we can close out this PR. I'm sorry if this is not as satisfying as auto-enabling this feature for everyone. I think between that, and using -fmarch=native, people will get plenty of optimizations out of just the compilers. And I realize you put a lot of work into the dispatch mechanism, but please understand that we cannot accept all code changes that come through this pipeline.

I totally understand. I will remove my dispatch mechanism and I'll reorganize it in a way to be only compiled with -DENABLE_AVX2=ON. Thanks for the feedback.

mbr0wn · 2026-02-27T13:06:29Z

BTW, what kind of improvements did you see for AVX2 vs. SSE2?

anilgurses · 2026-03-02T17:14:02Z

If you are asking for application specific, I got fewer (relatively) overflows with srsRAN and OpenAirInterface. I don't have any quantitative results for that unfortunately. We currently use this branch and with some additional changes. If you are asking for numbers for the latest benchmark,

================================================================================
                         Conversion  Best Priority      ns/sample Speedup vs Gen
--------------------------------------------------------------------------------

                  fc32 -> sc16_chdr           AVX2          0.503          2.18x
             fc32 -> sc16_item32_be           AVX2          0.505          4.05x
             fc32 -> sc16_item32_le           AVX2          0.501          2.17x
              fc32 -> sc8_item32_le           AVX2          0.431          3.08x
             sc16_item32_be -> fc32           AVX2          0.451          2.29x
             sc16_item32_be -> sc16           AVX2          0.299          2.42x
             sc16_item32_le -> fc32     SSE2/SSSE3          0.448          1.37x
             sc16_item32_le -> fc64           AVX2          0.757          1.56x
             sc16_item32_le -> sc16     SSE2/SSSE3          0.298          1.40x

These are the conversion types I use mostly.

PS. I will push new changes based on your comment next week.

mbr0wn · 2026-03-13T17:13:48Z

@anilgurses are you still working on this? If not, I would merge it in its current state. If you wanted to check out the alignment, then I'd wait for it.

anilgurses · 2026-03-14T07:27:32Z

@mbr0wn Yes, I am. I'll push my latest changes today.

…verters

…g data into 128 bit halves

anilgurses · 2026-03-15T05:05:23Z

Ok, it's done now. I also removed unaligned vs aligned handling for avx2. It seems like mm256_loadu can load aligned data almost with no penalty. Therefore, I removed it and it's much easier to understand the code. Also as a ref,
https://stackoverflow.com/questions/32612190/how-to-solve-the-32-byte-alignment-issue-for-avx-load-store-operations

anilgurses · 2026-03-15T05:07:38Z

Also, I've deleted convert_benchmark until examples, which was wrong from the beginning. I've added batch functionality into converter_benchmark.cpp (let me know if it's ok structurewise). It can be executed as follows

$./utils/converter_benchmark --batch --compare

================================================================================
Summary: Fastest Converter for Each Type
================================================================================
                         Conversion  Best Priority      ns/sample Speedup vs Gen
--------------------------------------------------------------------------------
             fc32 -> sc12_item32_le     SSE2/SSSE3          0.491          2.03x
                  fc32 -> sc16_chdr           AVX2          0.239          2.91x
             fc32 -> sc16_item32_be           AVX2          0.244          6.53x
             fc32 -> sc16_item32_le           AVX2          0.252          3.01x
              fc32 -> sc8_item32_le           AVX2          0.207          4.80x
                  fc64 -> sc16_chdr           AVX2          0.342          2.93x
             fc64 -> sc16_item32_le           AVX2          0.345          3.29x
             sc12_item32_le -> fc32     SSE2/SSSE3          0.355          2.61x
             sc12_item32_le -> sc16     SSE2/SSSE3          0.176          2.76x
             sc16 -> sc12_item32_le     SSE2/SSSE3          0.253          2.15x
             sc16 -> sc16_item32_be           AVX2          0.144          2.34x
             sc16 -> sc16_item32_le           AVX2          0.141          1.31x
                  sc16_chdr -> fc32           AVX2          0.239          1.04x
                  sc16_chdr -> fc64     SSE2/SSSE3          0.462          1.33x
             sc16_item32_be -> fc32           AVX2          0.251          2.84x
             sc16_item32_be -> sc16           AVX2          0.147          2.36x
             sc16_item32_le -> fc32           AVX2          0.254          1.32x
             sc16_item32_le -> fc64           AVX2          0.447          1.33x
             sc16_item32_le -> sc16           AVX2          0.140          1.37x
              sc8_item32_le -> fc32           AVX2          0.213          2.45x

mbr0wn · 2026-03-16T19:58:41Z

OK many thanks @anilgurses! We'll take it from here. It'll probably take a week or so until you see this PR be closed. The code on the public master could be even later than that.

anilgurses · 2026-03-16T22:49:06Z

Great! Thanks for your feedback along the way.

mbr0wn · 2026-03-18T10:00:49Z

Great! Thanks for your feedback along the way.

You're very welcome. FYI, I will close this PR (even though our internal testing and review is not yet done), just so we can lock down the state of this. If we're lucky, like I said, this might make it into the next release (if not, the one after that).

Thanks again!

anilgurses force-pushed the avx2-support branch from a060a42 to db32fa6 Compare March 24, 2025 22:08

anilgurses force-pushed the avx2-support branch from db32fa6 to 1b78ab6 Compare March 24, 2025 22:11

mbr0wn reviewed Nov 4, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings February 3, 2026 05:44

Copilot started reviewing on behalf of anilgurses February 3, 2026 05:44 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

anilgurses and others added 3 commits February 24, 2026 10:13

convert: Add AVX2 converters

e425abe

Note: This commit does not yet register the converters.

cmake: Allow using ENABLE_AVX2 to enable AVX2 converters

e5523c9

Add AVX2 support for uhd::convert

6ca72e6

Add runtime dispatch for avx2 converters and converter benchmark Signed-off-by: Anıl Gürses <anilgurses98@gmail.com>

mbr0wn force-pushed the avx2-support branch from cd93aaa to 6ca72e6 Compare February 24, 2026 09:22

mbr0wn reviewed Feb 25, 2026

View reviewed changes

host/examples/convert_benchmark.cpp Outdated Show resolved Hide resolved

host/lib/convert/avx2_sc16_to_fc64.cpp Outdated Show resolved Hide resolved

host/lib/convert/avx2_sc16_to_sc16.cpp Outdated Show resolved Hide resolved

anilgurses added 3 commits March 15, 2026 00:41

Remove runtime dispatch for converters and fix alignment for avx2 con…

35c251a

…verters

Optimize convert&scale by eliminating extra instructions for splittin…

2f15a01

…g data into 128 bit halves

Delete convert_benchmark and add batch option to converter_benchmark.cpp

804f224

anilgurses added 3 commits March 15, 2026 01:15

Remove runtime dispatch for converter residues

6802ce3

Remove runtime dispatch related comments

2d2e70d

Remove redundant priority on convert test

25bf6c5

mbr0wn closed this Mar 18, 2026

github-actions bot locked and limited conversation to collaborators Mar 18, 2026

Conversation

anilgurses commented Sep 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Details

Description

Related Issue

Which devices/areas does this affect?

Testing Done

Checklist

Uh oh!

github-actions bot commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anilgurses commented Mar 24, 2025

Uh oh!

anilgurses commented Oct 27, 2025

Uh oh!

mbr0wn commented Nov 4, 2025

Uh oh!

mbr0wn Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

mbr0wn Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

anilgurses commented Nov 4, 2025

Uh oh!

anilgurses commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anilgurses commented Feb 19, 2026

Uh oh!

mbr0wn commented Feb 24, 2026

Uh oh!

anilgurses commented Feb 24, 2026

Uh oh!

mbr0wn commented Feb 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anilgurses commented Feb 25, 2026

Uh oh!

mbr0wn commented Feb 27, 2026

Uh oh!

anilgurses commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbr0wn commented Mar 13, 2026

Uh oh!

anilgurses commented Mar 14, 2026

Uh oh!

anilgurses commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anilgurses commented Mar 15, 2026

Uh oh!

mbr0wn commented Mar 16, 2026

Uh oh!

anilgurses commented Mar 16, 2026

Uh oh!

mbr0wn commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

anilgurses commented Sep 29, 2024 •

edited

Loading

github-actions bot commented Mar 24, 2025 •

edited

Loading

anilgurses commented Mar 2, 2026 •

edited

Loading

anilgurses commented Mar 15, 2026 •

edited

Loading