host: Add AVX2 support for uhd::convert#789
host: Add AVX2 support for uhd::convert#789anilgurses wants to merge 9 commits intoEttusResearch:masterfrom
Conversation
a060a42 to
db32fa6
Compare
|
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
db32fa6 to
1b78ab6
Compare
|
Hi! Is there anything else needed for this PR? |
|
Hey @anilgurses, sorry for never responding here. The problem is that AVX2 support is not ubiquitous, and we need a way to only deploy it on demand. Something like a glibc conditional dispatch. I was also thinking of merging this, but leaving it disabled unless explicitly enabled at compile time (this would not, for example, be the case for .deb files we distribute). But that's also work. |
host/lib/convert/CMakeLists.txt
Outdated
| ######################################################################## | ||
|
|
||
| # Check for SSE2 support | ||
| check_cxx_compiler_flag("-msse2" SSE2_SUPPORTED) |
There was a problem hiding this comment.
All of this assumes the compiling machine has the same arch as the executing machine.
host/lib/convert/CMakeLists.txt
Outdated
| # Check for AVX2 support | ||
| check_cxx_compiler_flag("-mavx512" AVX512_SUPPORTED) | ||
| if(AVX512_SUPPORTED) | ||
| message(STATUS "AVX512 is supported") |
There was a problem hiding this comment.
This means AVX512 is supported by the compiler, not that it's also supported by the CPU.
|
Thanks for the feedback! You are right. Let me check if I can find time to implement on-demand AVX512. I'll update this PR once it's ready. |
|
Sorry for the delay, I've worked on it and developed runtime dispatch of SIMD functions for converters. I've also added a converter benchmark tool under tests. I was wondering how much performance gain is achieved on which instruction. I'm putting the table below. The baseline for comparison is compiler optimizations for the generic converter. It is not surprising that avx2 performs better at bigger packets. I might implement avx512 and test it with that as well. I've been using the previous version of the code for a year. I didn't encounter any issues but I realized that I made a big mistake on the SIMD_PRIORITY, which I have fixed. I ran my tests on Xeon Gold 6240. Let me know if I am missing something or I can improve my PR. |
There was a problem hiding this comment.
Pull request overview
Adds AVX2-backed converter implementations and introduces runtime SIMD feature detection so UHD can select the best available uhd::convert implementation on a given CPU.
Changes:
- Added runtime CPU SIMD feature detection (SSE2/SSSE3/AVX2/AVX512F) and new SIMD priority levels.
- Implemented multiple AVX2 converters and updated existing SSE2/SSSE3 converters to register conditionally.
- Updated build system and tests/examples to account for new priorities and benchmarking.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| host/tests/convert_test.cpp | Updates priority list and changes benchmark test decorator behavior. |
| host/lib/convert/ssse3_unpack_sc12.cpp | Adds runtime SSSE3 check to avoid registering on unsupported CPUs. |
| host/lib/convert/ssse3_pack_sc12.cpp | Adds runtime SSSE3 check to avoid registering on unsupported CPUs. |
| host/lib/convert/sse2_sc8_to_fc64.cpp | Switches to SSE2 runtime-gated converter declaration macro. |
| host/lib/convert/sse2_sc8_to_fc32.cpp | Switches to SSE2 runtime-gated converter declaration macro. |
| host/lib/convert/sse2_sc16_to_sc16.cpp | Switches to SSE2 runtime-gated converter declaration macro. |
| host/lib/convert/sse2_sc16_to_fc64.cpp | Switches to SSE2 runtime-gated converter declaration macro. |
| host/lib/convert/sse2_sc16_to_fc32.cpp | Switches to SSE2 runtime-gated converter declaration macro. |
| host/lib/convert/sse2_fc64_to_sc8.cpp | Switches to SSE2 runtime-gated converter declaration macro. |
| host/lib/convert/sse2_fc64_to_sc16.cpp | Switches to SSE2 runtime-gated converter declaration macro. |
| host/lib/convert/sse2_fc32_to_sc8.cpp | Switches to SSE2 runtime-gated converter declaration macro. |
| host/lib/convert/sse2_fc32_to_sc16.cpp | Switches to SSE2 runtime-gated converter declaration macro. |
| host/lib/convert/simd_features.hpp | New header for runtime SIMD detection + logging helpers. |
| host/lib/convert/convert_impl.cpp | Logs SIMD capabilities during converter/item-size registration. |
| host/lib/convert/convert_common.hpp | Adds SIMD converter macros with runtime gating and AVX2/AVX512 priorities. |
| host/lib/convert/avx2_sc8_to_fc32.cpp | New AVX2 converter implementation(s). |
| host/lib/convert/avx2_sc16_to_sc16.cpp | New AVX2 converter implementation(s). |
| host/lib/convert/avx2_sc16_to_fc64.cpp | New AVX2 converter implementation(s). |
| host/lib/convert/avx2_sc16_to_fc32.cpp | New AVX2 converter implementation(s). |
| host/lib/convert/avx2_fc64_to_sc8.cpp | New AVX2 converter implementation(s). |
| host/lib/convert/avx2_fc64_to_sc16.cpp | New AVX2 converter implementation(s). |
| host/lib/convert/avx2_fc32_to_sc8.cpp | New AVX2 converter implementation(s). |
| host/lib/convert/avx2_fc32_to_sc16.cpp | New AVX2 converter implementation(s). |
| host/lib/convert/CMakeLists.txt | Adds compiler-flag-based SIMD build detection and includes AVX2 sources. |
| host/examples/convert_benchmark.cpp | Adds a standalone converter benchmarking example. |
| host/examples/CMakeLists.txt | Builds the new convert_benchmark example. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Idk how this AI review thing got into my PR but most of them were nonsense. I disabled it entirely from my github. I've corrected one of my mistake though. It's working fine on my test setup. Let me know if anything else is needed. |
Note: This commit does not yet register the converters.
Add runtime dispatch for avx2 converters and converter benchmark Signed-off-by: Anıl Gürses <anilgurses98@gmail.com>
|
@anilgurses OK I started a more in-depth review. I force-pushed to your branch to give you a clue where this is going. You did some great work here! First, we need to split the converters from the dispatch logic. I don't think we're going to use that -- too many unknowns. There are better dispatch mechanisms (like glibc's dynamic dispatch based on available hardware) that we might look at. But basically, the dispatch logic is too invasive for my taste and our current bandwidth of reviewing and incorporating. In the branch I pushed there are now 3 commits. First, just the converters. This commit doesn't do much by itself. Second, a commit by me that includes the AVX2 converters if you specify My main issue with this PR is that the AVX2 converters have very little performance gain over the SSE2 converters. I saw a few percent increase. Now that's not nothing -- when we're streaming, we'll take everything. It's just not enough for me to justify spending a lot of time creating a safe dispatch logic (remember, we need to run on Linux, Windows, Mac, but people also run on other OSes like *BSD, we support all bunch of distro versions, many architectures, etc.). A couple of things:
|
|
Thanks for the comments.
Sure, let me check that. I will replace the current one with glibc's dynamic dispatch.
Sorry for that, I think I was using my local clang-format config. I will check this before I push it again.
Yep, that's how I wrote it :) This one is easy to fix. I will fix them quickly.
I will double check before I ask you to review it again so you don't waste your time. This feedback is very useful. I will try to finish them once I get the chance. |
@anilgurses Please don't -- this is something that takes some major coordination, including with our partners who help us package UHD for the various Linux distros. And it also doesn't help with Windows, if we even care. And what if we want to dispatch different implementations of functions for other things than the converters? I really appreciate your efforts and enthusiasm, but a solid, stable, portable, and future-proof dynamic dispatch of algorithms based on the available hardware is a major very invasive change to UHD. I'm sure you're skilled enough to implement that, but we wouldn't have the bandwidth (right now) to even properly review such a change, let alone roll it out for the upcoming UHD release. I would like to suggest that we focus on the converters, and the "dumb" way of enabling them through |
I totally understand. I will remove my dispatch mechanism and I'll reorganize it in a way to be only compiled with -DENABLE_AVX2=ON. Thanks for the feedback. |
|
BTW, what kind of improvements did you see for AVX2 vs. SSE2? |
|
If you are asking for application specific, I got fewer (relatively) overflows with srsRAN and OpenAirInterface. I don't have any quantitative results for that unfortunately. We currently use this branch and with some additional changes. If you are asking for numbers for the latest benchmark, These are the conversion types I use mostly. PS. I will push new changes based on your comment next week. |
|
@anilgurses are you still working on this? If not, I would merge it in its current state. If you wanted to check out the alignment, then I'd wait for it. |
|
@mbr0wn Yes, I am. I'll push my latest changes today. |
|
Ok, it's done now. I also removed unaligned vs aligned handling for avx2. It seems like mm256_loadu can load aligned data almost with no penalty. Therefore, I removed it and it's much easier to understand the code. Also as a ref, |
|
Also, I've deleted convert_benchmark until examples, which was wrong from the beginning. I've added batch functionality into converter_benchmark.cpp (let me know if it's ok structurewise). It can be executed as follows |
|
OK many thanks @anilgurses! We'll take it from here. It'll probably take a week or so until you see this PR be closed. The code on the public master could be even later than that. |
|
Great! Thanks for your feedback along the way. |
You're very welcome. FYI, I will close this PR (even though our internal testing and review is not yet done), just so we can lock down the state of this. If we're lucky, like I said, this might make it into the next release (if not, the one after that). Thanks again! |
Pull Request Details
Description
AVX2 support is implemented for uhd::convert. It was previously limited with sse2. It provides performance improvements for data type conversion.
Related Issue
N/A
Which devices/areas does this affect?
Affects the uhd::convert data conversion performance.
Testing Done
Testing is done using the tests written previously by UHD developers. It passes all previous tests and there is no need for new tests.
Checklist
MPM compat, noc_shell, specific RFNoC block, ...)