Enable AVX512VL + AVX512DQ#5694
Conversation
|
CI is happy. Nothing bad so far. |
|
It seems it tests CPU support twice (as seen below): configure: Trying to force avx512bw using default method (--enable-simd=avx512bw).
checking if gcc supports -mavx512bw -mavx512vl -mavx512dq w/ linking... yes
checking for extra ASFLAGS... None needed
checking for X32 ABI... no
checking special compiler flags... Intel x86
configure: Testing tool-chain's CPU support with given options
checking for MMX... yes
checking for SSE2... yes
checking for SSSE3... yes
checking for SSE4.1... yes
checking for SSE4.2... yes
checking for AVX... yes
checking for XOP... no
checking for AVX2... yes
checking for AVX512BW + AVX512VL + AVX512DQ... yes
checking if gcc supports -maes -mpclmul... yesIt doesn't hurt. |
doc/NEWS
Outdated
| - Add Oubliette Password Manager support (two formats and oubliette2john.py). | ||
| [DavideDG; 2025] | ||
|
|
||
| - Turn AVX512 into AVX512BW + AVX512VL + AVX512DQ. [Claudio André; 2025] |
There was a problem hiding this comment.
That's confusing. I suggest:
- Use AVX512VL XOP-like bit rotates for scrypt's Salsa20. [Solar; 2025]
- When we use AVX512BW, also enable usage of AVX512VL and AVX512DQ. [Claudio André; 2025]
src/configure.ac
Outdated
| done | ||
| else | ||
| CPU_BEST_FLAGS_MAIN=-DJOHN_$(echo ${SIMD_NAME} | tr .a-z _A-Z) | ||
| fi |
There was a problem hiding this comment.
I doubt we need this complication. Can't we just continue with JOHN_AVX512BW alone, but understand that it implies VL and DQ? I also don't know whether the += syntax works with other shells.
There was a problem hiding this comment.
This is possible if everyone understands that BW implies the rest.
src/configure.ac
Outdated
| CPU_NAME="$host_cpu AVX512BW" | ||
| else | ||
| CPU_NAME="$host_cpu $SIMD_NAME" | ||
| fi |
There was a problem hiding this comment.
Looks like unneeded complication as well.
src/m4/jtr_x86_logic.m4
Outdated
| AS_IF([test "x$CPU_NOTFOUND" = x0], | ||
| [ | ||
| CFLAGS="$CFLAGS_BACKUP -mavx512f -P $EXTRA_AS_FLAGS $CPPFLAGS $CFLAGS_EXTRA $CPUID_ASM" | ||
| CFLAGS="$CFLAGS_BACKUP -mavx512bw -mavx512vl -mavx512dq -P $EXTRA_AS_FLAGS $CPPFLAGS $CFLAGS_EXTRA $CPUID_ASM" |
There was a problem hiding this comment.
If we're not implementing the full reverse order of checks + optimization, then maybe let's not reorder F vs. BW here? If we were checking F first, then continue to check it first. This PR's changes would be smaller then.
src/m4/jtr_x86_logic.m4
Outdated
| [CPU_BEST_FLAGS="-mavx512f"] | ||
| [SIMD_NAME="AVX512F"] | ||
| [CPU_BEST_FLAGS="-mavx512bw -mavx512vl -mavx512dq"] | ||
| [SIMD_NAME="AVX512(BW+VL+DQ)"] |
There was a problem hiding this comment.
Maybe continue to say just AVX512BW here.
src/m4/jtr_x86_logic.m4
Outdated
| #include <stdio.h> | ||
| extern void exit(int); | ||
| int main(){__m512i t, t1;*((long long*)&t)=1;t1=t;t=_mm512_mul_epi32(t1,t);if((*(long long*)&t)==88)printf(".");exit(0);}]] | ||
| int main(){__m128i t, t1;*((long long*)&t)=1;t1=t;t=_mm_rol_epi32(t1,1);if((*(long long*)&t)==88)printf(".");exit(0);}]] |
There was a problem hiding this comment.
I did suggest using the same intrinsic we actually use, but I didn't mean to test it instead of testing any 512-bit BW intrinsic. I think we should either revert this change entirely or test both _mm_rol_epi32 and _mm512_mul_epi32.
While there are no current nor planned CPUs that have BW without VL nor vice versa, there may be future CPUs supporting AVX10/256 where the 128-bit VL intrinsic would compile and run yet this wouldn't imply support for 512-bit BW. Such future CPUs wouldn't set the CPUID bit corresponding to VL, but here we're not checking CPUID at all.
There was a problem hiding this comment.
Oh, I didn't realize you previously got the _mm512_mul_epi32 from the section for F, not for BW. Then revert to what we were checking for BW, please.
There was a problem hiding this comment.
there may be future CPUs supporting AVX10/256 where the 128-bit VL intrinsic would compile and run yet this wouldn't imply support for 512-bit BW. Such future CPUs wouldn't set the CPUID bit corresponding to VL
Upon a second thought, actually maybe they would set that CPUID bit. It's no problem, and no reason to change anything in this PR - I am just correcting what I wrote for the sake of it. We may want to add AVX10/256 support later, with a separate PR, and maybe when such CPUs actually appear and can be tested. As a guess, maybe we'll be checking for VL alone as a separate configure test from BW+VL+DQ, and would need to treat it differently in code (in many ways, including CPUID check and non-usage of 512-bit vectors).
| #define CPU_NAME "AVX512BW" | ||
| #define CPU_REQ_AVX512VL 1 | ||
| #define CPU_REQ_AVX512DQ 1 | ||
| #define CPU_NAME "AVX512(BW+VL+DQ)" |
There was a problem hiding this comment.
We can keep all 3 mentioned in CPU_NAME, for reporting in the "Sorry" line. (No further change is needed here.)
There was a problem hiding this comment.
The addition of CPU_REQ_AVX512VL (+DQ) is also required. At least desired.
solardiz
left a comment
There was a problem hiding this comment.
This looks almost good enough to merge, with only trivial cleanups maybe left. Thank you, @claudioandre-br!
| CFLAGS="$CFLAGS_BACKUP -mavx512bw -mavx512vl -mavx512dq -P $EXTRA_AS_FLAGS $CPPFLAGS $CFLAGS_EXTRA $CPUID_ASM" | ||
|
|
||
| AC_MSG_CHECKING([for AVX512BW]) | ||
| AC_MSG_CHECKING([for AVX512BW + AVX512VL + AVX512DQ]) |
There was a problem hiding this comment.
Strictly speaking, the test program we run only checks BW and VL, and then we assume DQ is implied. So we could want to make it just for AVX512BW + AVX512VL here.
There was a problem hiding this comment.
Now it gets confusing.
The else part (runs when --enable-simd=avx512bw) has a test program that only tests AVX512BW + AVX512VL.
The if part (runs when --native-tests=true) does not use a test program. It uses CPU_detect().
- In any case, both use the
-mavx512dqflag. CPU_detectwithout setting a value for CPU_REQ_* seems wrong to me. It should be like this:
I added a #define CPU_REQ_AVX512BW 1
#define CPU_REQ_AVX512BW 1
extern int CPU_detect(void); extern char CPU_req_name[];
unsigned int nt_buffer8x[4], output8x[4];
int main(int argc, char **argv) { return !CPU_detect(); }Anyway, should I remove DQ string ???? Is a new commit with a fix for cpu_detection required?
There was a problem hiding this comment.
Oh. I think it's OK to leave this as you have it for this PR, no further change needed. Thank you!
| AS_IF([test "x$CPU_NOTFOUND" = x0], | ||
| [ | ||
| AC_MSG_CHECKING([for AVX512BW]) | ||
| AC_MSG_CHECKING([for AVX512BW + AVX512VL + AVX512DQ]) |
There was a problem hiding this comment.
... and here.
(I don't get why we have this in two places.)
There was a problem hiding this comment.
I don't get why we have this in two places.
The first half of m4/jtr_x86_logic.m4 checks what the build host supports (using cpuid), unless cross compiling. The second half of it is (only) for cross compiling [eg. fallbacks], so it just checks what the toolchain can do.
| #define C7_AVX512F $0x00010000 | ||
| #define C7_AVX512BW $0x40010000 /* AVX512BW + AVX512F */ | ||
| #define C7_AVX512VL $0xC0010000 /* AVX512BW + AVX512VL + AVX512F */ | ||
| #define C7_AVX512DQ $0xC0030000 /* AVX512BW + AVX512DQ + AVX512VL + AVX512F */ |
There was a problem hiding this comment.
(I didn't review the specific bitmasks against the documentation. I just hope they're correct.)
Binary john needs AVX512VL's XOP-like bit rotates for faster Salsa20 in yescrypt. Without `VL` enabled compilers don't use mnemonics at all. As it stands now, the possible binaries are: - AVX512BW + AVX512VL + AVX512DQ - AVX512F - AVX2 - And so on. There is no AVX512BW only binary. See: #5691. Signed-off-by: Claudio André <dev@claudioandre.slmail.me>
|
So far, everything seems to be fine. Version: 1.9.0-jumbo-1+bleeding-7146e4c827 2025-03-12 05:14:17 +0100
Build: cygwin 64-bit x86_64 AVX512BW AC OMP OPENCL
SIMD: AVX512BW, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
AES hardware acceleration: AES-NI
CPU tests: AVX512(BW+VL+DQ)
CPU fallback binary: john-avx2-omp
OMP fallback binary: john-avx512bw
[...]
Cygwin version: 3.5.7-1.x86_64, 2025-01-29 19:46 UTCWill run 2 OpenMP threads
Testing: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (2xOMP) PASS
Testing: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 512/512 AVX512F]... (2xOMP) PASS
Testing: md5crypt, crypt(3) $1$ (and variants) [MD5 512/512 AVX512BW 16x3]... (2xOMP) PASS
Testing: md5crypt-long, crypt(3) $1$ (and variants) [MD5 32/64]... (2xOMP) PASS
Testing: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X3]... (2xOMP) PASS
Testing: scrypt (16384, 8, 1) [Salsa20/8 128/128 AVX512VL]... (2xOMP) PASS
[...] |
That's because you forced AVX512BW on the command line... I guess the second test is redundant then. That forcing stuff was added later. |
|
So I'm now seeing this: It says "AVX512BW" twice (which is oddly specific, so I was lead to believe it was literally just that), but then AVX512(BW+VL+DQ) in the "CPU tests" line. I was worried I had ended up with a Frankenstein build that didn't actually have VL or DQ instructions but the scrypt format does say VL so I guess all is fine. This output is confusing but I'm not sure how to make it better. Another problem is that I should apparently cross compile using This is not a problem for me because I recalled seeing this PR, but how would a user or even a package maintainer know that the only correct way of writing it is I'm not sure I have any suggestion for this problem either, other than maybe we should parse And if that last idea holds, maybe that leads to an answer for the first problem. It could say: This output would be much less confusing. |
Printing only The Regarding other issues, the good thing is that people experimenting should know what they are doing or avoid doing it on production systems. |
|
On second thought, forcing |
Oh, indeed it does! I must have made a typo when I tried that.
We could, but I think we should instead stop I didn't test this yet, but I think something like this addresses what I mean: diff --git a/src/configure.ac b/src/configure.ac
index ba480c409..b27e03d11 100644
--- a/src/configure.ac
+++ b/src/configure.ac
@@ -438,12 +438,11 @@ case "$simd" in
JTR_FLAG_CHECK_LINK([-mpower8vector], 2)
SIMD_NAME="Altivec2"
;;
- dnl Handle known cases of --enable-simd=foo --> -mfoo
- avx512|avx512bw)
- SIMD_NAME="AVX512BW"
- AC_MSG_NOTICE([Trying to force $SIMD_NAME using default method (--enable-simd=$simd).])
+ avx512)
JTR_FLAG_CHECK_LINK([-mavx512bw -mavx512vl -mavx512dq], 2)
+ SIMD_NAME="AVX512"
;;
+ dnl Handle known cases of --enable-simd=foo --> -mfoo
mmx|sse*|ssse3|avx*|xop*)
SIMD_NAME=`echo $simd | tr a-z A-Z`
AC_MSG_NOTICE([Trying to force $SIMD_NAME using default method (--enable-simd=$simd).])
I'm all for allowing people to experiment but we don't want them to struggle. |
Target CPU ......................................... x86_64 AVX512, 64-bit LE
Target OS .......................................... linux-gnuVersion: 1.9.0-jumbo-1+bleeding-aa93adfa59 2025-03-27 09:33:47 -0300
Build: linux-gnu 64-bit x86_64 AVX512 AC OMP
[...]
AES hardware acceleration: AES-NI
CPU tests: AVX512(BW+VL+DQ)See also openwall/john-packages#778. |
Let's hear bots.
I'll remove
runstatedirafter testing.OR:
Version: 1.9.0-jumbo-1+bleeding-60f3614a06 2025-03-10 07:27:48 -0300 Build: linux-gnu 64-bit x86_64 AVX512(BW+VL+DQ) AC OMP SIMD: AVX512BW, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1 AES hardware acceleration: AES-NI CPU tests: AVX512(BW+VL+DQ) $JOHN is ../run/