Top 10 x86 Instructions Covering Hardware-Accelerated Cryptography & Finite Field Arithmetic, Vectorised String Processing & Text Mining, Precision Scientific Computing, Digital Signal Processing, FFI, AVX-512 SIMD, Galois Fields (GF(2ⁿ)), AES-NI & Hardware Transactional Memory — Including 10 Honorable Mentions
- Introduction
- Architectural Foundations
- Repository Structure
- Deep Learning & Tensor Operations
- Cryptography & Finite Fields
- String Processing & Text Mining
- Scientific Computing & Signal Processing
- Honorable Mentions: The Unsung Heroes
- Arbitrary Precision Arithmetic (ADCX/ADOX)
- Cycle-Accurate Timing (RDTSC/RDTSCP)
- Population Count (POPCNT)
- Intentional Undefined (UD2)
- Hardware Random Numbers (RDRAND/RDSEED)
- Transactional Memory (XBEGIN/XEND)
- AES Acceleration (AESENC/AESDEC)
- Cache Control (CLFLUSH/CLFLUSHOPT/CLWB)
- Byte Swap (BSWAP)
- Cache Invalidation (INVD/WBINVD)
- Bit Manipulation (BMI1/BMI2)
- CRC32 Acceleration
- Performance Analysis & Microarchitecture
- References & Manuals
Single Instruction, Multiple Data (SIMD) represents the fundamental paradigm shift in modern processor design, enabling parallel execution of identical operations across multiple data elements simultaneously. This atlas provides comprehensive coverage of x86-64 SIMD extensions from their SSE origins through the latest AVX-512 instruction subsets.
The theoretical speedup of SIMD operations follows from Amdahl's Law applied to data parallelism:
Where:
-
$S_{\text{SIMD}}$ = Speedup factor -
$P$ = Fraction of parallelizable code -
$N$ = Vector width (elements per register)
For AVX-512 with 512-bit registers processing 32-bit floats (
timeline
title x86 SIMD Evolution
1997 : MMX (64-bit, integer only)
1999 : SSE (128-bit XMM, SP float)
2001 : SSE2 (DP float, integer)
2004 : SSE3 (horizontal ops)
2006 : SSSE3 (shuffle, abs)
2007 : SSE4.1 (blend, round)
2008 : SSE4.2 (strings, CRC)
2011 : AVX (256-bit YMM)
2013 : AVX2 (integer 256-bit)
2016 : AVX-512 (512-bit ZMM)
2019 : AVX-512 VNNI
2023 : AVX-512 FP16, AVX10
The x86-64 SIMD register file follows a nested structure where smaller registers alias to portions of larger ones:
┌──────────────────────────────────────────────────────────────────┐
│ ZMM0 (512 bits) │
├────────────────────────────────────┬─────────────────────────────┤
│ YMM0 (256 bits) │ (upper 256) │
├──────────────────┬─────────────────┤ │
│ XMM0 (128 bits)│ (upper 128) │ │
└──────────────────┴─────────────────┴─────────────────────────────┘
Register counts by extension:
| Extension | Registers | Width | Total Capacity |
|---|---|---|---|
| SSE-SSE4 | XMM0-15 | 128b | 256 bytes |
| AVX/AVX2 | YMM0-15 | 256b | 512 bytes |
| AVX-512 | ZMM0-31 | 512b | 2048 bytes |
AVX-512 also introduces 8 opmask registers (k0-k7) for predicated execution:
Optimal SIMD performance requires proper data alignment. Misaligned accesses on older architectures incurred significant penalties:
Cache line split behavior:
graph LR
subgraph "Cache Line 0 (64 bytes)"
A[Bytes 0-63]
end
subgraph "Cache Line 1 (64 bytes)"
B[Bytes 64-127]
end
C[Aligned 64B Load] --> A
D[Misaligned Load] --> A
D --> B
Alignment requirements by instruction set:
| Instruction | Aligned Variant | Unaligned Variant | Penalty (pre-Nehalem) |
|---|---|---|---|
| SSE Load | MOVAPS |
MOVUPS |
~10 cycles |
| AVX Load | VMOVAPS |
VMOVUPS |
~0-1 cycles |
| AVX-512 | VMOVAPS |
VMOVUPS |
~0 cycles |
AVX-512 instructions use the EVEX prefix (4 bytes), enabling:
- 32 vector registers (5-bit encoding)
- Opmask registers
- Broadcast from memory
- Embedded rounding control
- Suppress-all-exceptions (SAE)
EVEX Prefix Structure:
┌────────┬────────┬────────┬────────┐
│ 62h │ P0 │ P1 │ P2 │
└────────┴────────┴────────┴────────┘
│ │ │
│ │ └── aaa: opmask, z: zeroing, L'L: vector length
│ └── vvvv: NDS register, W: operand size
└── RXB: register extensions, R': high-16 regs, mm: opcode map
graph TD;
Root[x86 SIMD Atlas]
Root-->AI_ML[Deep Learning & AI]
Root-->Crypto[Cryptography]
Root-->Strings[String Processing]
Root-->HPC[Scientific HPC]
Root-->Util[Utility Scripts]
AI_ML-->VPDPBUSD[VPDPBUSD.asm<br/>INT8 Dot Product]
AI_ML-->VPTERNLOGD[VPTERNLOGD.asm<br/>Ternary Logic]
Crypto-->PCLMULQDQ[PCLMULQDQ.asm<br/>GF Multiplication]
Strings-->PCMPESTRI[PCMPESTRI.asm<br/>Substring Search]
Strings-->PMOVMSKB[PMOVMSKB.asm<br/>Mask Extraction]
HPC-->VGATHERDPS[VGATHERDPS.asm<br/>Sparse Gather]
HPC-->VPERM2B[VPERM2B.asm<br/>Byte Permutation]
HPC-->RSQRTPS[RSQRTPS.asm<br/>Fast Inv Sqrt]
HPC-->MPSADBW[MPSADBW.asm<br/>Motion Estimation]
Util-->rename[rename_to_uppercase.py]
Util-->readme[readme.py]
style AI_ML fill:#e1f5fe
style Crypto fill:#fff3e0
style Strings fill:#f3e5f5
style HPC fill:#e8f5e9
Modern neural networks rely on massive matrix multiplications. The computational intensity of a single convolution layer is:
Where
Instruction: VPDPBUSD (Vector Packed Dot Product of Unsigned Byte and Signed Byte to Dword)
File: VPDPBUSD.asm
This instruction performs the core INT8 multiply-accumulate operation essential for quantised neural network inference:
Where each 32-bit accumulator lane
Mathematical breakdown for one lane:
Input A (unsigned bytes): [a₀, a₁, a₂, a₃] (values 0-255)
Input B (signed bytes): [b₀, b₁, b₂, b₃] (values -128 to 127)
Output = ACC + (a₀×b₀) + (a₁×b₁) + (a₂×b₂) + (a₃×b₃)
Throughput comparison (ops/cycle on Cascade Lake):
xychart-beta
title "INT8 Throughput Comparison"
x-axis ["Scalar", "SSE4", "AVX2", "AVX-512", "VNNI"]
y-axis "GOP/s per Core" 0 --> 4000
bar [125, 500, 1000, 2000, 4000]
Implementation:
; filepath: VPDPBUSD.asm
global vnni_dot_product
section .text
vnni_dot_product:
; Inputs:
; ZMM0: Accumulator (32-bit integers, 16 lanes)
; ZMM1: Input A (Unsigned Bytes - 64 elements)
; ZMM2: Input B (Signed Bytes - 64 elements)
vpdpbusd zmm0, zmm1, zmm2
; Each of 16 dword lanes accumulates:
; sum of 4 products of (u8 × s8)
retQuantisation error analysis:
The quantisation from FP32 to INT8 introduces error bounded by:
For symmetric quantisation with scale
Instruction: VPTERNLOGD (Bitwise Ternary Logic)
File: VPTERNLOGD.asm
VPTERNLOGD implements any 3-input boolean function via an 8-bit lookup table (immediate):
The truth table maps all 8 input combinations to outputs:
| Index | A | B | C | LUT Bit Position |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | imm8[0] |
| 1 | 0 | 0 | 1 | imm8[1] |
| 2 | 0 | 1 | 0 | imm8[2] |
| 3 | 0 | 1 | 1 | imm8[3] |
| 4 | 1 | 0 | 0 | imm8[4] |
| 5 | 1 | 0 | 1 | imm8[5] |
| 6 | 1 | 1 | 0 | imm8[6] |
| 7 | 1 | 1 | 1 | imm8[7] |
Common immediate values:
| Operation | Formula | imm8 |
|---|---|---|
| A AND B AND C | 0x80 | |
| A OR B OR C | 0xFE | |
| A XOR B XOR C | 0x96 | |
| (A AND B) OR C | 0xF8 | |
| (A OR B) XOR C | 0x1E | |
| Majority(A,B,C) | 0xE8 | |
| A ? B : C (MUX) | 0xCA |
Instruction count reduction:
graph LR
subgraph "Traditional (3 instructions)"
A1[VPAND zmm3, zmm0, zmm1]
A2[VPOR zmm3, zmm3, zmm2]
A3[VPXOR zmm3, zmm3, zmm4]
end
subgraph "VPTERNLOGD (1 instruction)"
B1[VPTERNLOGD zmm0, zmm1, zmm2, imm8]
end
Affine Quantisation:
Where:
-
$r$ = real value -
$s$ = scale factor -
$z$ = zero point -
$q_{min}, q_{max}$ = quantised range bounds
Scale and zero-point computation:
Matrix multiplication in quantised domain:
For
Instruction: PCLMULQDQ (Carry-Less Multiplication Quadword)
File: PCLMULQDQ.asm
Performs polynomial multiplication over GF(2), essential for:
- AES-GCM authentication tags
- CRC computation
- Error-correcting codes
Mathematical definition:
For polynomials
Where:
Note: XOR (
Immediate byte selection:
| imm8 | Operation |
|---|---|
| 0x00 | xmm1[63:0] × xmm2[63:0] |
| 0x01 | xmm1[127:64] × xmm2[63:0] |
| 0x10 | xmm1[63:0] × xmm2[127:64] |
| 0x11 | xmm1[127:64] × xmm2[127:64] |
GCM GHASH computation:
GHASH(H, A, C) = X_m+n+1
Where:
X_0 = 0
X_i = (X_{i-1} ⊕ A_i) • H for i = 1..m (AAD blocks)
X_i = (X_{i-1} ⊕ C_i) • H for i = m+1..m+n (Ciphertext blocks)
X_{m+n+1} = (X_{m+n} ⊕ (len(A)||len(C))) • H
The multiplication (•) in GF(2¹²⁸) uses PCLMULQDQ followed by reduction modulo the GCM polynomial:
GF(2ⁿ) representation:
Elements are polynomials with binary coefficients, represented as n-bit integers:
Reduction algorithm for GF(2¹²⁸):
After PCLMULQDQ produces a 256-bit product, reduce modulo the irreducible polynomial:
; GF(2^128) reduction after PCLMULQDQ
; Input: xmm0:xmm1 = 256-bit product
; Output: xmm0 = 128-bit reduced result
gf128_reduce:
; Fold high 128 bits into low
movdqa xmm2, xmm1
pclmulqdq xmm2, xmm3, 0x00 ; xmm3 contains reduction constant
pxor xmm0, xmm2
; Second reduction round
movdqa xmm2, xmm0
psrldq xmm2, 8
pclmulqdq xmm2, xmm3, 0x00
pxor xmm0, xmm2
retSubBytes transformation (S-box):
Where
MixColumns transformation:
Multiplication by 02 and 03 in GF(2⁸):
Instruction: PCMPESTRI (Packed Compare Explicit Length Strings, Return Index)
File: PCMPESTRI.asm
This SSE4.2 instruction performs complex string operations in a single instruction:
Control byte (imm8) encoding:
| Bits | Field | Description |
|---|---|---|
| 1:0 | Source Data Format | 00=UBytes, 01=UWords, 10=SBytes, 11=SWords |
| 3:2 | Aggregation Operation | 00=Equal Any, 01=Ranges, 10=Equal Each, 11=Equal Ordered |
| 5:4 | Polarity | 00=Positive, 01=Negative, 10=Masked+, 11=Masked- |
| 6 | Output Selection | 0=Least Significant, 1=Most Significant |
Aggregation operations:
-
Equal Any (00): Character set membership
$$\text{IntRes1}[i] = \bigvee_{j=0}^{n-1} (\text{Src1}[j] = \text{Src2}[i])$$ -
Ranges (01): Character range membership
$$\text{IntRes1}[i] = \bigvee_{j=0}^{n/2-1} (\text{Src1}[2j] \leq \text{Src2}[i] \leq \text{Src1}[2j+1])$$ -
Equal Each (10): Byte-by-byte equality
$$\text{IntRes1}[i] = (\text{Src1}[i] = \text{Src2}[i])$$ -
Equal Ordered (11): Substring search
$$\text{IntRes1}[i] = \bigwedge_{j=0}^{n-1} (\text{Src1}[j] = \text{Src2}[i+j])$$
Example: Finding "World" in "Hello World":
; filepath: PCMPESTRI.asm
; ...existing code...
pcmpestri xmm0, xmm1, 0x0C ; Equal Ordered, byte data
; RCX = 6 (index where "World" starts)
; ...existing code...Performance vs. scalar loop:
xychart-beta
title "Substring Search Performance (cycles per search)"
x-axis ["strlen", "strstr scalar", "PCMPESTRI", "AVX2 hybrid"]
y-axis "Cycles" 0 --> 500
bar [45, 380, 85, 120]
Instruction: PMOVMSKB (Move Byte Mask)
File: PMOVMSKB.asm
Extracts the most significant bit of each byte into a general-purpose register:
Null terminator detection algorithm:
1. Load 16 bytes into XMM register
2. Compare each byte against zero (PCMPEQB)
3. Extract comparison mask (PMOVMSKB)
4. If mask ≠ 0, find first set bit (BSF)
5. Repeat for next 16-byte chunk
Vectorised strlen complexity:
Compared to scalar:
Vectorised memchr:
vectorised_memchr:
; rdi = buffer, rsi = byte to find, rdx = length
movd xmm0, esi
pxor xmm1, xmm1
pshufb xmm0, xmm1 ; Broadcast byte to all lanes
.loop:
movdqu xmm1, [rdi]
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
test eax, eax
jnz .found
add rdi, 16
sub rdx, 16
jg .loop
xor eax, eax
ret
.found:
bsf eax, eax
add rax, rdi
retInstruction: VGATHERDPS (Gather Packed Single-Precision Floating-Point Values with Signed Dword Indices)
File: VGATHERDPS.asm
Performs indexed loads from memory using vector indices:
Memory access pattern visualisation:
graph TB
subgraph "Memory (Linear)"
M0[addr+0]
M1[addr+4]
M2[addr+8]
M3[addr+12]
M4[addr+16]
M5[addr+20]
M6[addr+24]
M7[addr+28]
end
subgraph "Indices ZMM1"
I0[0]
I1[4]
I2[8]
I3[12]
end
subgraph "Result ZMM0"
R0[data@0]
R1[data@16]
R2[data@32]
R3[data@48]
end
I0 --> M0
I1 --> M4
I2 --> M2
I3 --> M6
M0 --> R0
M4 --> R1
M2 --> R2
M6 --> R3
Gather performance model:
Where
Sparse matrix-vector multiplication:
For
spmv_csr:
; For each row i:
; y[i] = sum(A.values[j] * x[A.col_idx[j]]) for j in [row_ptr[i], row_ptr[i+1])
; Load column indices
vmovdqu32 zmm1, [col_idx + rcx*4]
; Gather x values at those indices
kxnorw k1, k1, k1
vgatherdps zmm2{k1}, [x + zmm1*4]
; Load A values and multiply
vmovups zmm3, [values + rcx*4]
vmulps zmm4, zmm2, zmm3
; Horizontal sum for y[i]
; ...Instruction: MPSADBW (Multiple Packed Sums of Absolute Differences)
File: MPSADBW.asm
Computes 8 SADs simultaneously for block matching in video codecs:
For
Block matching visualisation:
Reference Block (4 bytes): [R₀, R₁, R₂, R₃]
Source Window (11 bytes): [S₀, S₁, S₂, S₃, S₄, S₅, S₆, S₇, S₈, S₉, S₁₀]
├──────────┤
Position 0
├──────────┤
Position 1
├──────────┤
Position 2
... (8 positions total)
Motion estimation search diamond pattern:
graph TD
C((Center)) --> N((N))
C --> S((S))
C --> E((E))
C --> W((W))
N --> |MPSADBW| B1[SAD values]
S --> |MPSADBW| B2[SAD values]
E --> |MPSADBW| B3[SAD values]
W --> |MPSADBW| B4[SAD values]
B1 --> MIN[Minimum]
B2 --> MIN
B3 --> MIN
B4 --> MIN
MIN --> |New Center| C
Rate-Distortion optimisation:
Where:
-
$J$ = Cost function to minimise -
$D$ = Distortion (SAD or SATD) -
$R$ = Bit rate for encoding motion vector -
$\lambda$ = Lagrangian multiplier
Instruction: RSQRTPS (Reciprocal Square Root of Packed Single-Precision Floating-Point Values)
File: RSQRTPS.asm
Computes approximate reciprocal square root:
With relative error
Newton-Raphson refinement:
For
Or equivalently:
Implementation with one NR iteration:
fast_rsqrt_nr:
; xmm0 = input x
rsqrtps xmm1, xmm0 ; y0 ≈ 1/sqrt(x)
; Newton-Raphson: y1 = 0.5 * y0 * (3 - x * y0^2)
movaps xmm2, xmm1
mulps xmm2, xmm1 ; y0^2
mulps xmm2, xmm0 ; x * y0^2
movaps xmm3, [three] ; 3.0
subps xmm3, xmm2 ; 3 - x*y0^2
mulps xmm3, xmm1 ; y0 * (3 - x*y0^2)
mulps xmm3, [half] ; 0.5 * y0 * (3 - x*y0^2)
; xmm3 = refined result (error < 2^-23)
retError analysis:
| Method | Max Relative Error | Latency (cycles) |
|---|---|---|
| RSQRTPS only | 4 | |
| RSQRTPS + 1 NR | 14 | |
| RSQRTPS + 2 NR | 24 | |
| SQRTPS + DIVPS |
|
25 |
Vector normalisation:
For vector
Instruction: VPERM2B (Full Permute of Bytes from Two Tables)
File: VPERM2B.asm
Performs arbitrary byte-level permutation using indices from two source registers:
Applications:
- AES T-table lookups
- Base64 encoding/decoding
- LUT-based S-boxes
- Data shuffling
Example: Byte reversal within 64-bit lanes:
byte_reverse_64:
; Control indices for reversing bytes within each qword
vmovdqu64 zmm2, [reverse_indices]
; reverse_indices: 7,6,5,4,3,2,1,0, 15,14,13,12,11,10,9,8, ...
vperm2b zmm0, zmm1, zmm2
retRadix-2 Cooley-Tukey FFT butterfly:
Where
Vectorised butterfly (4 complex numbers):
fft_butterfly_avx:
; ymm0 = [e0r, e0i, e1r, e1i, e2r, e2i, e3r, e3i] (even)
; ymm1 = [o0r, o0i, o1r, o1i, o2r, o2i, o3r, o3i] (odd)
; ymm2 = [w0r, w0i, w1r, w1i, w2r, w2i, w3r, w3i] (twiddles)
; Complex multiply: (a+bi)(c+di) = (ac-bd) + (ad+bc)i
vpermilps ymm3, ymm1, 0xB1 ; [o0i, o0r, ...]
vmulps ymm4, ymm1, ymm2 ; [o*wr, o*wi, ...]
vfmaddsub231ps ymm4, ymm3, ymm2 ; Complex product
; Butterfly
vaddps ymm5, ymm0, ymm4 ; E + W*O
vsubps ymm6, ymm0, ymm4 ; E - W*O
retConvolution theorem:
Complexity:
This section provides comprehensive coverage of critical instructions that don't fit neatly into the categories above but are essential for systems programming.
Instructions: ADCX (Add with Carry Flag) and ADOX (Add with Overflow Flag)
These instructions enable dual carry chains for efficient big-integer arithmetic:
Why two carry chains matter:
In traditional multi-precision addition, each ADD depends on the previous carry:
Traditional (serial):
ADD r0, a0, b0 ; CF = carry0
ADC r1, a1, b1 ; needs carry0, CF = carry1
ADC r2, a2, b2 ; needs carry1, CF = carry2
...
With ADCX/ADOX, two independent additions can proceed in parallel:
Parallel carry chains:
ADCX r0, a0 ; CF chain
ADOX r4, c0 ; OF chain (independent!)
ADCX r1, a1 ; CF chain continues
ADOX r5, c1 ; OF chain continues
RSA Montgomery multiplication:
The Montgomery reduction algorithm computes:
Where
montmul_inner:
; Compute partial product and accumulate
mulx rdx, rax, [b + rcx*8] ; rdx:rax = a[i] * b[j]
adcx r8, rax ; Accumulate low (CF chain)
adox r9, rdx ; Accumulate high (OF chain)
; Continue for all limbs...Performance for 4096-bit RSA:
xychart-beta
title "RSA-4096 Performance (ops/second)"
x-axis ["Scalar ADD", "ADCX/ADOX", "AVX-512 IFMA"]
y-axis "Operations/s" 0 --> 2000
bar [450, 1200, 1850]
Instructions: RDTSC (Read Time-Stamp Counter) and RDTSCP (with Processor ID)
The Time-Stamp Counter is a 64-bit register counting reference cycles since reset:
Precise timing template:
benchmark_function:
; Serialise to prevent out-of-order execution
cpuid
rdtsc
shl rdx, 32
or rax, rdx
mov r12, rax ; Start timestamp
; === Code to benchmark ===
call target_function
; =========================
; Serialise again
rdtscp
shl rdx, 32
or rax, rdx
sub rax, r12 ; Elapsed cycles
retTSC frequency calculation:
On modern processors, TSC runs at the "nominal" frequency:
To convert cycles to nanoseconds:
Caveats:
- Older CPUs: TSC varies with frequency scaling
- Multi-socket: TSC may differ between sockets
- VM environments: TSC may be virtualised
Instruction: POPCNT (Population Count)
Counts the number of set bits (Hamming weight):
Applications:
-
Chess engines (bitboards):
Knights attack mask: popcount(knight_attacks & enemy_pieces) -
Bioinformatics (Hamming distance):
$$d_H(x, y) = \text{POPCNT}(x \oplus y)$$ -
Bloom filters:
$$\text{Fill ratio} = \frac{\text{POPCNT}(\text{filter})}{\text{size}}$$ -
Set cardinality in bitmaps:
size_t count = 0; for (size_t i = 0; i < n; i++) count += __builtin_popcountll(bitmap[i]);
Vectorised population count (AVX-512 VPOPCNTDQ):
; Count bits in 8 qwords simultaneously
vpopcntq zmm1, zmm0
; Each qword in zmm1 contains popcount of corresponding qword in zmm0Instruction: UD2 (Undefined Instruction)
Intentionally raises an Invalid Opcode Exception (#UD):
Opcode: 0F 0B
Behavior: #UD exception (Interrupt 6)
Use cases:
-
Compiler trap for unreachable code:
__builtin_unreachable(); // Often compiles to UD2
-
Debug assertions:
if (!condition) __builtin_trap(); // UD2
-
Kernel panic/halt:
kernel_panic: cli ud2 jmp kernel_panic
-
Marking invalid code paths:
; After noreturn function call exit ud2 ; Should never execute
Instructions:
RDRAND: Get random number from DRNG (NIST SP 800-90A compliant)RDSEED: Get entropy directly from hardware source
Entropy source:
Intel's on-die Digital Random Number Generator uses:
- Thermal noise from transistor metastability
- AES-CBC-MAC conditioning
- CTR_DRBG for RDRAND output
Usage pattern:
get_secure_random:
xor ecx, ecx
.retry:
rdrand rax
jc .success ; CF=1 means valid random
inc ecx
cmp ecx, 10
jl .retry
xor eax, eax ; Failed after 10 attempts
ret
.success:
; rax contains 64 bits of random data
retRDSEED vs RDRAND:
| Property | RDRAND | RDSEED |
|---|---|---|
| Source | DRBG output | Raw entropy |
| Rate | Higher (several GB/s) | Lower (limited by entropy) |
| Failure rate | Very low | Higher under heavy load |
| Use case | General random | Seeding other PRNGs |
Instructions:
XBEGIN rel32: Begin transaction (branch to rel32 on abort)XEND: Commit transactionXABORT imm8: Explicitly abort with reason code
Transactional execution model:
stateDiagram-v2
[*] --> Transactional: XBEGIN
Transactional --> Committed: XEND
Transactional --> Aborted: Conflict/XABORT
Committed --> [*]
Aborted --> FallbackPath
FallbackPath --> [*]
Lock elision pattern:
transaction_update:
xacquire lock add [lock_var], 1 ; HLE prefix (deprecated)
; Or with RTM:
xbegin .fallback
; Transactional region - no locks needed!
mov rax, [shared_data]
add rax, 1
mov [shared_data], rax
xend
ret
.fallback:
; Transaction aborted - use traditional locking
lock add [shared_data], 1
retAbort codes (returned in EAX):
| Bit | Meaning |
|---|---|
| 0 | Abort caused by XABORT |
| 1 | Transaction may succeed on retry |
| 2 | Conflict with another processor |
| 3 | Buffer overflow |
| 4 | Debug breakpoint hit |
| 5 | Nested transaction abort |
| 23:24 | XABORT immediate value |
Current status: Intel TSX has been disabled on many processors due to security vulnerabilities (TAA, ZombieLoad).
Instructions:
AESENC xmm, xmm/m128: One AES encryption roundAESENCLAST xmm, xmm/m128: Final AES encryption roundAESDEC xmm, xmm/m128: One AES decryption roundAESDECLAST xmm, xmm/m128: Final AES decryption roundAESKEYGENASSIST xmm, xmm/m128, imm8: Key expansion assistance
AES round structure:
Each AESENC performs:
- SubBytes (S-box substitution)
- ShiftRows (cyclic row shifts)
- MixColumns (column mixing)
- AddRoundKey (XOR with round key)
Complete AES-128 encryption:
aes128_encrypt:
; xmm0 = plaintext, xmm1-xmm11 = round keys
pxor xmm0, xmm1 ; Round 0: AddRoundKey only
aesenc xmm0, xmm2 ; Round 1
aesenc xmm0, xmm3 ; Round 2
aesenc xmm0, xmm4 ; Round 3
aesenc xmm0, xmm5 ; Round 4
aesenc xmm0, xmm6 ; Round 5
aesenc xmm0, xmm7 ; Round 6
aesenc xmm0, xmm8 ; Round 7
aesenc xmm0, xmm9 ; Round 8
aesenc xmm0, xmm10 ; Round 9
aesenclast xmm0, xmm11 ; Round 10 (no MixColumns)
; xmm0 = ciphertext
retPerformance (cycles per byte):
| Implementation | AES-128 | AES-256 |
|---|---|---|
| Software table | ~20 cpb | ~25 cpb |
| AES-NI | ~0.7 cpb | ~0.9 cpb |
| AES-NI + pipelining | ~0.4 cpb | ~0.5 cpb |
Instructions:
CLFLUSH m8: Flush cache line (ordered, slow)CLFLUSHOPT m8: Flush cache line (weakly ordered, fast)CLWB m8: Write back cache line (keep in cache)
Cache hierarchy:
┌─────────────┐
│ L1 Data │ 32-64 KB, ~4 cycles
├─────────────┤
│ L2 │ 256 KB - 1 MB, ~12 cycles
├─────────────┤
│ L3 │ 8-64 MB, ~40 cycles
├─────────────┤
│ DRAM │ ~200 cycles
└─────────────┘
Use cases:
-
Persistent memory (PMEM) programming:
; Ensure data reaches persistent storage mov [pmem_addr], rax clwb [pmem_addr] sfence
-
DMA coherency:
; Flush before DMA read from device clflush [dma_buffer] mfence ; Now safe for device to read
-
Security (defense against cache timing attacks):
; Clear sensitive data from cache clflush [secret_key] clflush [secret_key + 64]
-
Rowhammer research/exploitation:
.hammer: mov rax, [addr1] clflush [addr1] mov rbx, [addr2] clflush [addr2] mfence jmp .hammer
Instruction: BSWAP reg32/64
Reverses byte order for endianness conversion:
32-bit transformation:
Before: [B₃, B₂, B₁, B₀] (little-endian)
After: [B₀, B₁, B₂, B₃] (big-endian)
Network byte order conversion:
; htonl equivalent
htonl:
bswap edi
mov eax, edi
ret
; ntohll (64-bit)
ntohll:
bswap rdi
mov rax, rdi
retSIMD byte reversal (for vectors):
bswap_16bytes:
; xmm0 = input
; Reverse bytes using PSHUFB
movdqa xmm1, [bswap_mask] ; 15,14,13,...,2,1,0
pshufb xmm0, xmm1
retInstructions:
INVD: Invalidate caches WITHOUT writeback (Ring 0 only)WBINVD: Write back AND invalidate all caches (Ring 0 only)
INVD sequence:
1. All modified cache lines are DISCARDED
2. Memory state becomes inconsistent
3. System may crash or corrupt data
4. Only use in very specific firmware scenarios
Legitimate use cases:
-
BIOS/UEFI initialisation:
; Before enabling caching for the first time wbinvd ; Configure MTRRs ; Enable caching
-
Cache-as-RAM (CAR) teardown:
; After copying CAR data to real RAM wbinvd ; Reconfigure caching
-
Hardware reset preparation:
; Ensure memory is consistent before reset wbinvd cli hlt
BMI1 instructions:
-
ANDN r, r, r/m:$\text{dest} = (\lnot \text{src1}) \land \text{src2}$ -
BEXTR r, r/m, r: Extract bit field -
BLSI r, r/m: Isolate lowest set bit:$\text{dest} = \text{src} \land (-\text{src})$ -
BLSMSK r, r/m: Mask up to lowest set bit:$\text{dest} = \text{src} \oplus (\text{src} - 1)$ -
BLSR r, r/m: Reset lowest set bit:$\text{dest} = \text{src} \land (\text{src} - 1)$ -
TZCNT r, r/m: Count trailing zeros
BMI2 instructions:
BZHI r, r/m, r: Zero high bits starting at positionMULX r, r, r/m: Unsigned multiply without affecting flagsPDEP r, r, r/m: Parallel bit depositPEXT r, r, r/m: Parallel bit extractRORX r, r/m, imm8: Rotate right without affecting flagsSARX/SHLX/SHRX r, r/m, r: Shifts without affecting flags
PDEP/PEXT visualisation:
PEXT (extract):
Source: 1 0 1 1 0 1 1 0
Mask: 1 0 1 0 0 1 0 0
Result: 0 0 0 0 0 1 1 1 (extracted bits: 1,1,1)
PDEP (deposit):
Source: 0 0 0 0 0 1 0 1
Mask: 1 0 1 0 0 1 0 0
Result: 1 0 0 0 0 1 0 0 (deposited at mask positions)
Chess move generation with PEXT:
generate_rook_attacks:
; rdi = square, rsi = occupied bitboard
mov rax, [rook_masks + rdi*8] ; Get relevant occupancy mask
pext rax, rsi, rax ; Extract relevant bits
mov rcx, [rook_attacks + rdi*8] ; Get attack table base
mov rax, [rcx + rax*8] ; Look up attacks
retInstruction: CRC32 r, r/m
Hardware-accelerated CRC-32C (Castagnoli polynomial):
Polynomial representation: 0x1EDC6F41 (iSCSI, ext4, Btrfs)
Usage:
crc32c_block:
; rdi = buffer, rsi = length
xor eax, eax ; Initialise CRC to 0 (or ~0 for standard)
.loop_8:
cmp rsi, 8
jl .loop_1
crc32 rax, qword [rdi]
add rdi, 8
sub rsi, 8
jmp .loop_8
.loop_1:
test rsi, rsi
jz .done
crc32 rax, byte [rdi]
inc rdi
dec rsi
jmp .loop_1
.done:
; Finalise: return ~crc for standard CRC-32C
not eax
retPerformance comparison:
| Method | Throughput |
|---|---|
| Table-driven (8-bit) | ~500 MB/s |
| Slicing-by-8 | ~2 GB/s |
| CRC32 instruction | ~20 GB/s |
| PCLMULQDQ folding | ~40 GB/s |
Key instructions across microarchitectures (cycles):
| Instruction | Skylake | Ice Lake | Zen 3 | Zen 4 |
|---|---|---|---|---|
| VADDPS (256) | 4 | 4 | 3 | 3 |
| VMULPS (256) | 4 | 4 | 3 | 3 |
| VFMADD (256) | 4 | 4 | 4 | 4 |
| VDIVPS (256) | 11 | 11 | 10 | 10 |
| VSQRTPS (256) | 12 | 12 | 12 | 13 |
| VPERM2B (512) | 3 | 3 | - | 4 |
| VPDPBUSD (512) | 5 | 5 | - | 4 |
| PCLMULQDQ | 7 | 6 | 4 | 3 |
| AESENC | 4 | 4 | 4 | 3 |
Extended latency table (additional instructions):
| Instruction | Skylake | Ice Lake | Zen 3 | Zen 4 | Description |
|---|---|---|---|---|---|
| VADDPS (512) | 4 | 4 | - | 3 | FP32 vector add |
| VMULPS (512) | 4 | 4 | - | 3 | FP32 vector multiply |
| VFMADD (512) | 4 | 4 | - | 4 | Fused multiply-add |
| VDIVPS (512) | 18 | 11 | - | 10 | FP32 vector divide |
| VSQRTPS (512) | 20 | 19 | - | 13 | FP32 vector sqrt |
| VPAND (256) | 1 | 1 | 1 | 1 | Bitwise AND |
| VPOR (256) | 1 | 1 | 1 | 1 | Bitwise OR |
| VPXOR (256) | 1 | 1 | 1 | 1 | Bitwise XOR |
| VPSHUFB (256) | 1 | 1 | 1 | 1 | Byte shuffle |
| VPERMD (256) | 3 | 3 | 2 | 2 | Dword permute |
| VPERMQ (256) | 3 | 3 | 2 | 2 | Qword permute |
| VGATHERDPS (256) | 22 | 15 | 12 | 10 | Gather (best case) |
| VSCATTERDPS (512) | 40 | 20 | - | 15 | Scatter (best case) |
| POPCNT | 3 | 3 | 1 | 1 | Population count |
| LZCNT | 3 | 3 | 1 | 1 | Leading zero count |
| TZCNT | 3 | 3 | 1 | 1 | Trailing zero count |
| PDEP | 3 | 3 | 18* | 3 | Parallel bit deposit |
| PEXT | 3 | 3 | 18* | 3 | Parallel bit extract |
| CRC32 | 3 | 3 | 3 | 3 | CRC acceleration |
| RDTSC | 20 | 15 | 9 | 8 | Read timestamp |
| RDRAND | ~400 | ~200 | ~1200 | ~800 | Hardware random |
*Note: Zen 1-3 implement PDEP/PEXT via microcode, resulting in high latency. Zen 4 has native hardware support.
Skylake/Ice Lake execution ports:
graph TD
subgraph "Front End (4-wide decode)"
IF[Instruction Fetch] --> ID[Instruction Decode]
ID --> UOP[µop Queue]
UOP --> RAT[Register Alias Table]
RAT --> ROB[Reorder Buffer<br/>224 entries]
end
subgraph "Execution Engine (8 ports)"
ROB --> RS[Reservation Station<br/>97 entries]
RS --> P0[Port 0<br/>ALU, FMA<br/>DIV, SQRT]
RS --> P1[Port 1<br/>ALU, FMA<br/>AES, CLMUL]
RS --> P2[Port 2<br/>Load<br/>AGU]
RS --> P3[Port 3<br/>Load<br/>AGU]
RS --> P4[Port 4<br/>Store Data]
RS --> P5[Port 5<br/>ALU, Shuffle<br/>Branch]
RS --> P6[Port 6<br/>ALU<br/>Branch]
RS --> P7[Port 7<br/>Store AGU]
end
subgraph "Memory Subsystem"
P2 --> L1D[L1 Data Cache<br/>32KB, 8-way]
P3 --> L1D
P4 --> SB[Store Buffer<br/>56 entries]
SB --> L1D
L1D --> L2[L2 Cache<br/>256KB-1MB]
L2 --> L3[L3 Cache<br/>Shared]
end
Port assignment for key SIMD operations:
| Operation Category | Ports | Throughput |
|---|---|---|
| FP Add (256-bit) | P0, P1 | 2/cycle |
| FP Mul (256-bit) | P0, P1 | 2/cycle |
| FP FMA (256-bit) | P0, P1 | 2/cycle |
| FP Add (512-bit) | P0, P1 | 1/cycle |
| FP Mul (512-bit) | P0, P1 | 1/cycle |
| FP FMA (512-bit) | P0, P1 | 1/cycle |
| FP Div (any) | P0 | 1/3-18 cycles |
| FP Sqrt (any) | P0 | 1/4-20 cycles |
| Integer Add/Logic | P0, P1, P5 | 3/cycle |
| Shuffle (256-bit) | P5 | 1/cycle |
| Shuffle (512-bit) | P5 | 0.5/cycle |
| AES round | P0 | 1/cycle |
| CLMUL | P0 | 1/cycle |
| Load (256-bit) | P2, P3 | 2/cycle |
| Load (512-bit) | P2+P3 | 1/cycle |
| Store (256-bit) | P4+P7 | 1/cycle |
| Store (512-bit) | P4+P7 | 1/cycle |
Reciprocal throughput vs. latency:
Understanding the difference is critical for optimisation:
- Latency: Cycles from input ready to output ready (single instruction)
- Throughput: How often an instruction can start (reciprocal = cycles between starts)
Example: FMA throughput analysis:
Latency = 4 cycles
Throughput = 2 ops/cycle (ports P0 + P1)
Reciprocal Throughput = 0.5 cycles
To saturate both FMA units:
- Need 4 × 2 = 8 independent FMA chains
- Each chain must have 4 instructions in flight
Throughput comparison chart:
xychart-beta
title "Peak SIMD Throughput (GFLOPS @ 3GHz)"
x-axis ["SSE (128)", "AVX (256)", "AVX-512", "AVX-512 VNNI"]
y-axis "GFLOPS" 0 --> 400
bar [48, 96, 192, 384]
Calculating theoretical peak:
For AVX-512 at 3 GHz with 2 FMA units:
1. Port Contention:
When multiple instructions compete for the same execution port:
; BAD: All three need Port 0
vdivps ymm0, ymm1, ymm2 ; P0
vsqrtps ymm3, ymm4 ; P0
vrcpps ymm5, ymm6 ; P0
; Serialised execution!
; BETTER: Interleave with different ports
vdivps ymm0, ymm1, ymm2 ; P0 (starts)
vaddps ymm7, ymm8, ymm9 ; P0 or P1
vpshufb ymm10, ymm11, ymm12 ; P5
vmulps ymm13, ymm14, ymm15 ; P0 or P1
; Better utilisation2. Register Dependencies:
Read-after-write (RAW) hazards create stalls:
; Latency-bound chain (16 cycles total)
vfmadd231ps ymm0, ymm1, ymm2 ; 4 cycles
vfmadd231ps ymm0, ymm3, ymm4 ; waits for ymm0
vfmadd231ps ymm0, ymm5, ymm6 ; waits for ymm0
vfmadd231ps ymm0, ymm7, ymm8 ; waits for ymm0
; Throughput-optimised (parallel, 4 cycles + reduce)
vfmadd231ps ymm0, ymm1, ymm2 ; 4 independent chains
vfmadd231ps ymm10, ymm3, ymm4
vfmadd231ps ymm11, ymm5, ymm6
vfmadd231ps ymm12, ymm7, ymm8
vaddps ymm0, ymm0, ymm10 ; Reduce at end
vaddps ymm11, ymm11, ymm12
vaddps ymm0, ymm0, ymm113. Memory Bandwidth Limits:
For memory-bound code:
xychart-beta
title "Roofline Model (DDR4-3200, 50 GB/s)"
x-axis "Arithmetic Intensity (FLOP/Byte)" 0 --> 20
y-axis "GFLOPS" 0 --> 200
line [0, 50, 100, 150, 192, 192, 192, 192, 192, 192]
4. AVX-512 Frequency Throttling:
Heavy AVX-512 usage causes frequency reduction on some Intel CPUs:
| License Level | Trigger | Frequency Reduction |
|---|---|---|
| L0 (Normal) | Scalar/SSE | 0% |
| L1 (Light AVX) | AVX/AVX2 | ~3-5% |
| L2 (Heavy AVX-512) | AVX-512 (FP) | ~10-20% |
Mitigation strategies:
- Use AVX2 for short bursts
- Batch AVX-512 work to amortize transition overhead
- Profile actual workload performance, not microbenchmarks
Cache line utilisation:
Example: Stride access patterns:
| Stride | Utilisation | Effective BW |
|---|---|---|
| 4 bytes (sequential float) | 100% | 50 GB/s |
| 8 bytes (every other) | 50% | 25 GB/s |
| 64 bytes (1 per line) | 6.25% | 3.1 GB/s |
| 4096 bytes (1 per page) | 0.1% | 50 MB/s |
Prefetch distance calculation:
For 200-cycle memory latency, 4-cycle loop, 64 bytes/iteration:
Macro-op fusion (x86-64):
Certain instruction pairs fuse into single µops:
| Pattern | Fused? | Example |
|---|---|---|
| CMP + JCC | ✓ | cmp rax, rbx; jne label |
| TEST + JCC | ✓ | test rax, rax; jz label |
| ADD + JCC | ✓ (Intel) | add rax, 1; jnz label |
| AND + JCC | ✗ | and rax, rbx; jz label |
Micro-op counts for common SIMD:
| Instruction | µops (Skylake) | µops (Zen 3) |
|---|---|---|
| VADDPS ymm | 1 | 1 |
| VFMADD231PS ymm | 1 | 1 |
| VDIVPS ymm | 1 | 1 |
| VPSHUFB ymm | 1 | 1 |
| VGATHERDPS ymm | 5 | 4 |
| VSCATTERDPS zmm | 16+ | N/A |
| VPTERNLOGD zmm | 1 | N/A |
| VPERMI2D zmm | 3 | N/A |
Avoiding branches with SIMD:
; Scalar with branch (misprediction penalty ~15 cycles)
.loop:
mov eax, [rsi]
cmp eax, 0
jl .negative
mov [rdi], eax
jmp .next
.negative:
neg eax
mov [rdi], eax
.next:
; ...
; Branchless SIMD (absolute value)
vmovdqu ymm0, [rsi]
vpabsd ymm0, ymm0 ; No branches!
vmovdqu [rdi], ymm0Blend-based conditional:
; Conditional max(a, b) without branches
vcmpps ymm2, ymm0, ymm1, 1 ; ymm2 = (ymm0 < ymm1) ? 0xFF : 0x00
vblendvps ymm3, ymm0, ymm1, ymm2 ; Select larger valuesCalculate optimal unroll factor:
; 4x unrolled dot product
dot_product_unrolled:
vxorps ymm0, ymm0, ymm0 ; Accumulator 0
vxorps ymm1, ymm1, ymm1 ; Accumulator 1
vxorps ymm2, ymm2, ymm2 ; Accumulator 2
vxorps ymm3, ymm3, ymm3 ; Accumulator 3
.loop:
vmovups ymm4, [rdi]
vmovups ymm5, [rdi + 32]
vmovups ymm6, [rdi + 64]
vmovups ymm7, [rdi + 96]
vfmadd231ps ymm0, ymm4, [rsi]
vfmadd231ps ymm1, ymm5, [rsi + 32]
vfmadd231ps ymm2, ymm6, [rsi + 64]
vfmadd231ps ymm3, ymm7, [rsi + 96]
add rdi, 128
add rsi, 128
sub rcx, 32
jg .loop
; Reduce 4 accumulators
vaddps ymm0, ymm0, ymm1
vaddps ymm2, ymm2, ymm3
vaddps ymm0, ymm0, ymm2
; Horizontal sum of ymm0...
retOverlap loads with computation:
; Pipelined: load next while computing current
.prolog:
vmovups ymm8, [rdi] ; Load iteration 0
add rdi, 32
.loop:
vmovups ymm9, [rdi] ; Load iteration N+1
vfmadd231ps ymm0, ymm8, [rsi] ; Compute iteration N
vmovaps ymm8, ymm9 ; Prepare for next
add rdi, 32
add rsi, 32
sub rcx, 8
jg .loop
.epilog:
vfmadd231ps ymm0, ymm8, [rsi] ; Final iterationEfficient horizontal sum (256-bit):
horizontal_sum_256:
; Input: ymm0 = [a, b, c, d, e, f, g, h]
vextractf128 xmm1, ymm0, 1 ; xmm1 = [e, f, g, h]
vaddps xmm0, xmm0, xmm1 ; xmm0 = [a+e, b+f, c+g, d+h]
vhaddps xmm0, xmm0, xmm0 ; xmm0 = [a+e+b+f, c+g+d+h, ...]
vhaddps xmm0, xmm0, xmm0 ; xmm0 = [sum, sum, sum, sum]
; Result in xmm0[0]
retEfficient horizontal sum (512-bit):
horizontal_sum_512:
; Input: zmm0
vextractf64x4 ymm1, zmm0, 1
vaddps ymm0, ymm0, ymm1
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
vhaddps xmm0, xmm0, xmm0
vhaddps xmm0, xmm0, xmm0
ret; BAD: Zeroing upper bits unnecessarily
vmovss xmm0, [rdi] ; Zeros upper bits (SSE)
vaddss xmm0, xmm0, xmm1 ; But we just need scalar
; BETTER: Use VEX-encoded scalar ops
vaddss xmm0, xmm1, [rdi] ; Single instruction; EXPENSIVE: Cross-lane shuffle in AVX2 (3-cycle latency)
vpermd ymm0, ymm1, ymm2
; CHEAPER: In-lane operations when possible (1-cycle latency)
vpshufb ymm0, ymm1, ymm2 ; Within 128-bit lanes; SLOW: Gather for contiguous-ish data
vgatherdps ymm0, [rdi + ymm1*4], ymm2 ; 15+ cycles
; FASTER: Load + permute for predictable patterns
vmovups ymm0, [rdi]
vpermd ymm0, ymm3, ymm0 ; 4-5 cycles total#include <ittnotify.h>
void simd_kernel() {
__itt_task_begin(__itt_domain_create("SIMD"), __itt_null, __itt_null,
__itt_string_handle_create("kernel"));
// SIMD code here
__itt_task_end(__itt_domain_create("SIMD"));
}Key events to monitor:
| Counter | Description | Target |
|---|---|---|
| FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE | 256-bit FP ops | Maximise |
| FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE | 512-bit FP ops | Maximise |
| UOPS_DISPATCHED_PORT.PORT_0 | Port 0 usage | Balance |
| UOPS_DISPATCHED_PORT.PORT_1 | Port 1 usage | Balance |
| UOPS_DISPATCHED_PORT.PORT_5 | Port 5 usage | Balance |
| MEM_LOAD_RETIRED.L1_MISS | L1 cache misses | Minimise |
| MEM_LOAD_RETIRED.L3_MISS | L3 cache misses | Minimise |
| MACHINE_CLEARS.COUNT | Pipeline clears | Zero |
Using perf on Linux:
perf stat -e fp_arith_inst_retired.256b_packed_single,\
fp_arith_inst_retired.512b_packed_single,\
cycles,instructions ./my_simd_program| Symptom | Likely Cause | Solution |
|---|---|---|
| Low IPC | Dependency chains | Unroll, use more accumulators |
| High L1 misses | Poor spatial locality | Prefetch, tile loops |
| Port 0 saturated | Too many DIV/SQRT | Approximate or interleave |
| Frequency drop | Heavy AVX-512 | Consider AVX2, batch work |
| Branch mispredicts | Data-dependent branches | Use SIMD blend/mask |
-
Intel® 64 and IA-32 Architectures Software Developer's Manual
- Volume 1: Basic Architecture
- Volume 2: Instruction Set Reference
- Volume 3: System Programming Guide
- Intel SDM
-
Intel® Intrinsics Guide
- Interactive intrinsics reference
- Intel Intrinsics Guide
-
AMD64 Architecture Programmer's Manual
- Volume 1-5
- AMD Developer Guides
-
Agner Fog's Optimisation Manuals
- Instruction Tables
- Microarchitecture Guide
- Optimising Assembly
- Optimising C++
- Agner's Site
-
uops.info
- Detailed instruction latency/throughput data
- Port usage information
- uops.info
-
WikiChip
- CPU microarchitecture details
- WikiChip
-
Chips and Cheese
- Deep-dive microarchitecture analysis
- Chips and Cheese
-
Compiler Explorer (Godbolt)
- See generated assembly
- godbolt.org
-
LLVM-MCA (Machine Code Analyzer)
- Static analysis of assembly
llvm-mca -mcpu=skylake < code.s
-
Intel SDE (Software Development Emulator)
- Emulate future instruction sets
- Intel SDE
-
SIMD Vectorisation:
- "Automatic SIMD Vectorisation of SSA-based Control Flow Graphs" (CGO 2011)
- "Exploiting Superword Level Parallelism with Multimedia Instruction Sets" (PLDI 2000)
-
Performance Modeling:
- "Roofline: An Insightful Visual Performance Model" (CACM 2009)
- "A Stochastic Model for Superscalar Execution" (IEEE TC 2018)
XMM0-XMM15: 128-bit (SSE, AVX lower half)
YMM0-YMM15: 256-bit (AVX, AVX2)
ZMM0-ZMM31: 512-bit (AVX-512)
K0-K7: Opmask registers (AVX-512)
| Prefix | Meaning | Example |
|---|---|---|
| V | VEX/EVEX encoded | VADDPS vs ADDPS |
| P | Packed | PMULLD |
| S | Scalar | ADDSS |
| B | Byte | PSHUFB |
| W | Word | PMULLW |
| D | Dword | PADDD |
| Q | Qword | PADDQ |
| H | Horizontal | HADDPS |
| U | Unaligned | MOVUPS |
| A | Aligned | MOVAPS |
| NT | Non-temporal | MOVNTPS |
_mm256_add_ps
│ │ │
│ │ └─ Element type (ps=packed single, pd=packed double,
│ │ epi32=packed int32, si256=256-bit int)
│ └───── Operation
└────────── Register width (_mm=128, _mm256=256, _mm512=512)
#include <cpuid.h>
#include <stdbool.h>
typedef struct {
bool sse, sse2, sse3, ssse3, sse41, sse42;
bool avx, avx2;
bool avx512f, avx512bw, avx512dq, avx512vl;
bool avx512vnni, avx512vbmi, avx512ifma;
bool bmi1, bmi2, popcnt, aesni, pclmul;
bool fma, f16c, rdrand, rdseed;
} CPUFeatures;
CPUFeatures detect_features(void) {
CPUFeatures f = {0};
unsigned int eax, ebx, ecx, edx;
__cpuid(1, eax, ebx, ecx, edx);
f.sse = (edx >> 25) & 1;
f.sse2 = (edx >> 26) & 1;
f.sse3 = (ecx >> 0) & 1;
f.ssse3 = (ecx >> 9) & 1;
f.sse41 = (ecx >> 19) & 1;
f.sse42 = (ecx >> 20) & 1;
f.aesni = (ecx >> 25) & 1;
f.pclmul = (ecx >> 1) & 1;
f.avx = (ecx >> 28) & 1;
f.fma = (ecx >> 12) & 1;
f.f16c = (ecx >> 29) & 1;
f.popcnt = (ecx >> 23) & 1;
f.rdrand = (ecx >> 30) & 1;
__cpuid_count(7, 0, eax, ebx, ecx, edx);
f.avx2 = (ebx >> 5) & 1;
f.bmi1 = (ebx >> 3) & 1;
f.bmi2 = (ebx >> 8) & 1;
f.avx512f = (ebx >> 16) & 1;
f.avx512dq = (ebx >> 17) & 1;
f.avx512ifma = (ebx >> 21) & 1;
f.avx512bw = (ebx >> 30) & 1;
f.avx512vl = (ebx >> 31) & 1;
f.avx512vbmi = (ecx >> 1) & 1;
f.avx512vnni = (ecx >> 11) & 1;
f.rdseed = (ebx >> 18) & 1;
return f;
}