x86

Top 10 x86 Instructions Covering Hardware-Accelerated Cryptography & Finite Field Arithmetic, Vectorised String Processing & Text Mining, Precision Scientific Computing, Digital Signal Processing, FFI, AVX-512 SIMD, Galois Fields (GF(2ⁿ)), AES-NI & Hardware Transactional Memory — Including 10 Honorable Mentions

Introduction

Single Instruction, Multiple Data (SIMD) represents the fundamental paradigm shift in modern processor design, enabling parallel execution of identical operations across multiple data elements simultaneously. This atlas provides comprehensive coverage of x86-64 SIMD extensions from their SSE origins through the latest AVX-512 instruction subsets.

The SIMD Advantage

The theoretical speedup of SIMD operations follows from Amdahl's Law applied to data parallelism:

$$S_{\text{SIMD}} = \frac{1}{(1-P) + \frac{P}{N}}$$

Where:

$S_{\text{SIMD}}$ = Speedup factor
$P$ = Fraction of parallelizable code
$N$ = Vector width (elements per register)

For AVX-512 with 512-bit registers processing 32-bit floats ($N=16$), and $P=0.95$:

$$S_{\text{SIMD}} = \frac{1}{0.05 + \frac{0.95}{16}} = \frac{1}{0.05 + 0.059} \approx 9.2\times$$

Historical Evolution

timeline
    title x86 SIMD Evolution
    1997 : MMX (64-bit, integer only)
    1999 : SSE (128-bit XMM, SP float)
    2001 : SSE2 (DP float, integer)
    2004 : SSE3 (horizontal ops)
    2006 : SSSE3 (shuffle, abs)
    2007 : SSE4.1 (blend, round)
    2008 : SSE4.2 (strings, CRC)
    2011 : AVX (256-bit YMM)
    2013 : AVX2 (integer 256-bit)
    2016 : AVX-512 (512-bit ZMM)
    2019 : AVX-512 VNNI
    2023 : AVX-512 FP16, AVX10

Architectural Foundations

Register Hierarchy

The x86-64 SIMD register file follows a nested structure where smaller registers alias to portions of larger ones:

┌──────────────────────────────────────────────────────────────────┐
│                              ZMM0 (512 bits)                     │
├────────────────────────────────────┬─────────────────────────────┤
│              YMM0 (256 bits)       │         (upper 256)         │
├──────────────────┬─────────────────┤                             │
│   XMM0 (128 bits)│   (upper 128)   │                             │
└──────────────────┴─────────────────┴─────────────────────────────┘

Register counts by extension:

Extension	Registers	Width	Total Capacity
SSE-SSE4	XMM0-15	128b	256 bytes
AVX/AVX2	YMM0-15	256b	512 bytes
AVX-512	ZMM0-31	512b	2048 bytes

AVX-512 also introduces 8 opmask registers (k0-k7) for predicated execution:

$$\text{Result}_i = \begin{cases} \text{Operation}(A_i, B_i) & \text{if } k_j[i] = 1 \ \text{Merge/Zero} & \text{if } k_j[i] = 0 \end{cases}$$

Memory Alignment & Cache Considerations

Optimal SIMD performance requires proper data alignment. Misaligned accesses on older architectures incurred significant penalties:

$$T_{\text{misaligned}} = T_{\text{aligned}} + T_{\text{penalty}} \cdot \lceil \frac{\text{offset}}{\text{cache line}} \rceil$$

Cache line split behavior:

graph LR
    subgraph "Cache Line 0 (64 bytes)"
        A[Bytes 0-63]
    end
    subgraph "Cache Line 1 (64 bytes)"
        B[Bytes 64-127]
    end
    
    C[Aligned 64B Load] --> A
    D[Misaligned Load] --> A
    D --> B

Alignment requirements by instruction set:

Instruction	Aligned Variant	Unaligned Variant	Penalty (pre-Nehalem)
SSE Load	`MOVAPS`	`MOVUPS`	~10 cycles
AVX Load	`VMOVAPS`	`VMOVUPS`	~0-1 cycles
AVX-512	`VMOVAPS`	`VMOVUPS`	~0 cycles

Instruction Encoding

AVX-512 instructions use the EVEX prefix (4 bytes), enabling:

32 vector registers (5-bit encoding)
Opmask registers
Broadcast from memory
Embedded rounding control
Suppress-all-exceptions (SAE)

EVEX Prefix Structure:
┌────────┬────────┬────────┬────────┐
│  62h   │ P0     │ P1     │ P2     │
└────────┴────────┴────────┴────────┘
           │         │         │
           │         │         └── aaa: opmask, z: zeroing, L'L: vector length
           │         └── vvvv: NDS register, W: operand size
           └── RXB: register extensions, R': high-16 regs, mm: opcode map

Repository Structure

graph TD;
    Root[x86 SIMD Atlas]
    Root-->AI_ML[Deep Learning & AI]
    Root-->Crypto[Cryptography]
    Root-->Strings[String Processing]
    Root-->HPC[Scientific HPC]
    Root-->Util[Utility Scripts]
    
    AI_ML-->VPDPBUSD[VPDPBUSD.asm<br/>INT8 Dot Product]
    AI_ML-->VPTERNLOGD[VPTERNLOGD.asm<br/>Ternary Logic]
    
    Crypto-->PCLMULQDQ[PCLMULQDQ.asm<br/>GF Multiplication]
    
    Strings-->PCMPESTRI[PCMPESTRI.asm<br/>Substring Search]
    Strings-->PMOVMSKB[PMOVMSKB.asm<br/>Mask Extraction]
    
    HPC-->VGATHERDPS[VGATHERDPS.asm<br/>Sparse Gather]
    HPC-->VPERM2B[VPERM2B.asm<br/>Byte Permutation]
    HPC-->RSQRTPS[RSQRTPS.asm<br/>Fast Inv Sqrt]
    HPC-->MPSADBW[MPSADBW.asm<br/>Motion Estimation]
    
    Util-->rename[rename_to_uppercase.py]
    Util-->readme[readme.py]
    
    style AI_ML fill:#e1f5fe
    style Crypto fill:#fff3e0
    style Strings fill:#f3e5f5
    style HPC fill:#e8f5e9

Deep Learning & Tensor Operations

Modern neural networks rely on massive matrix multiplications. The computational intensity of a single convolution layer is:

$$\text{FLOPs} = 2 \times K^2 \times C_{in} \times C_{out} \times H_{out} \times W_{out}$$

Where $K$ = kernel size, $C$ = channels, and $H/W$ = spatial dimensions.

VNNI Dot Product (INT8)

Instruction: VPDPBUSD (Vector Packed Dot Product of Unsigned Byte and Signed Byte to Dword)

File: VPDPBUSD.asm

This instruction performs the core INT8 multiply-accumulate operation essential for quantised neural network inference:

$$\text{ACC}_i = \text{ACC}_i + \sum_{j=0}^{3} A_{4i+j}^{u8} \times B_{4i+j}^{s8}$$

Where each 32-bit accumulator lane $i$ receives products from 4 byte pairs.

Mathematical breakdown for one lane:

Input A (unsigned bytes): [a₀, a₁, a₂, a₃]  (values 0-255)
Input B (signed bytes):   [b₀, b₁, b₂, b₃]  (values -128 to 127)

Output = ACC + (a₀×b₀) + (a₁×b₁) + (a₂×b₂) + (a₃×b₃)

Throughput comparison (ops/cycle on Cascade Lake):

xychart-beta
    title "INT8 Throughput Comparison"
    x-axis ["Scalar", "SSE4", "AVX2", "AVX-512", "VNNI"]
    y-axis "GOP/s per Core" 0 --> 4000
    bar [125, 500, 1000, 2000, 4000]

Implementation:

; filepath: VPDPBUSD.asm
global vnni_dot_product

section .text
vnni_dot_product:
    ; Inputs:
    ; ZMM0: Accumulator (32-bit integers, 16 lanes)
    ; ZMM1: Input A (Unsigned Bytes - 64 elements)
    ; ZMM2: Input B (Signed Bytes - 64 elements)
    
    vpdpbusd zmm0, zmm1, zmm2
    
    ; Each of 16 dword lanes accumulates:
    ; sum of 4 products of (u8 × s8)
    ret

Quantisation error analysis:

The quantisation from FP32 to INT8 introduces error bounded by:

$$\epsilon_q = \frac{\text{scale}}{2} = \frac{\max(|x|)}{127 \times 2}$$

For symmetric quantisation with scale $s$ and zero-point $z=0$:

$$x_q = \text{round}\left(\frac{x}{s}\right), \quad x_{dq} = x_q \times s$$

Ternary Logic

Instruction: VPTERNLOGD (Bitwise Ternary Logic)

File: VPTERNLOGD.asm

VPTERNLOGD implements any 3-input boolean function via an 8-bit lookup table (immediate):

$$\text{Result} = \text{LUT}[A \cdot 4 + B \cdot 2 + C]$$

The truth table maps all 8 input combinations to outputs:

Index	A	B	C	LUT Bit Position
0	0	0	0	imm8[0]
1	0	0	1	imm8[1]
2	0	1	0	imm8[2]
3	0	1	1	imm8[3]
4	1	0	0	imm8[4]
5	1	0	1	imm8[5]
6	1	1	0	imm8[6]
7	1	1	1	imm8[7]

Common immediate values:

Operation	Formula	imm8
A AND B AND C	$A \land B \land C$	0x80
A OR B OR C	$A \lor B \lor C$	0xFE
A XOR B XOR C	$A \oplus B \oplus C$	0x96
(A AND B) OR C	$(A \land B) \lor C$	0xF8
(A OR B) XOR C	$(A \lor B) \oplus C$	0x1E
Majority(A,B,C)	$(A \land B) \lor (B \land C) \lor (A \land C)$	0xE8
A ? B : C (MUX)	$(A \land B) \lor (\lnot A \land C)$	0xCA

Instruction count reduction:

graph LR
    subgraph "Traditional (3 instructions)"
        A1[VPAND zmm3, zmm0, zmm1]
        A2[VPOR zmm3, zmm3, zmm2]
        A3[VPXOR zmm3, zmm3, zmm4]
    end
    
    subgraph "VPTERNLOGD (1 instruction)"
        B1[VPTERNLOGD zmm0, zmm1, zmm2, imm8]
    end

Quantisation Mathematics

Affine Quantisation:

$$q = \text{clip}\left(\text{round}\left(\frac{r}{s}\right) + z, q_{min}, q_{max}\right)$$

Where:

$r$ = real value
$s$ = scale factor
$z$ = zero point
$q_{min}, q_{max}$ = quantised range bounds

Scale and zero-point computation:

$$s = \frac{r_{max} - r_{min}}{q_{max} - q_{min}}$$

$$z = q_{min} - \text{round}\left(\frac{r_{min}}{s}\right)$$

Matrix multiplication in quantised domain:

For $Y = XW + b$:

$$Y_q = \frac{s_x \cdot s_w}{s_y}\left(X_q - z_x\right)\left(W_q - z_w\right) + z_y + \frac{b}{s_y}$$

Cryptography & Finite Fields

Carry-less Multiplication

Instruction: PCLMULQDQ (Carry-Less Multiplication Quadword)

File: PCLMULQDQ.asm

Performs polynomial multiplication over GF(2), essential for:

AES-GCM authentication tags
CRC computation
Error-correcting codes

Mathematical definition:

For polynomials $A(x) = \sum_{i=0}^{63} a_i x^i$ and $B(x) = \sum_{j=0}^{63} b_j x^j$:

$$C(x) = A(x) \cdot B(x) = \sum_{k=0}^{126} c_k x^k$$

Where:

$$c_k = \bigoplus_{i+j=k} (a_i \land b_j)$$

Note: XOR ($\oplus$) replaces addition (no carry propagation).

Immediate byte selection:

imm8	Operation
0x00	xmm1[63:0] × xmm2[63:0]
0x01	xmm1[127:64] × xmm2[63:0]
0x10	xmm1[63:0] × xmm2[127:64]
0x11	xmm1[127:64] × xmm2[127:64]

GCM GHASH computation:

GHASH(H, A, C) = X_m+n+1

Where:
X_0 = 0
X_i = (X_{i-1} ⊕ A_i) • H    for i = 1..m  (AAD blocks)
X_i = (X_{i-1} ⊕ C_i) • H    for i = m+1..m+n  (Ciphertext blocks)
X_{m+n+1} = (X_{m+n} ⊕ (len(A)||len(C))) • H

The multiplication (•) in GF(2¹²⁸) uses PCLMULQDQ followed by reduction modulo the GCM polynomial:

$$p(x) = x^{128} + x^7 + x^2 + x + 1$$

Galois Field Arithmetic

GF(2ⁿ) representation:

Elements are polynomials with binary coefficients, represented as n-bit integers:

$$a = a_{n-1}x^{n-1} + a_{n-2}x^{n-2} + \cdots + a_1x + a_0 \leftrightarrow (a_{n-1}, a_{n-2}, \ldots, a_0)_2$$

Reduction algorithm for GF(2¹²⁸):

After PCLMULQDQ produces a 256-bit product, reduce modulo the irreducible polynomial:

$$\text{Result} = \text{Product} \mod (x^{128} + x^7 + x^2 + x + 1)$$

; GF(2^128) reduction after PCLMULQDQ
; Input: xmm0:xmm1 = 256-bit product
; Output: xmm0 = 128-bit reduced result

gf128_reduce:
    ; Fold high 128 bits into low
    movdqa xmm2, xmm1
    pclmulqdq xmm2, xmm3, 0x00  ; xmm3 contains reduction constant
    pxor xmm0, xmm2
    
    ; Second reduction round
    movdqa xmm2, xmm0
    psrldq xmm2, 8
    pclmulqdq xmm2, xmm3, 0x00
    pxor xmm0, xmm2
    ret

AES Round Functions

SubBytes transformation (S-box):

$$S(a) = A \cdot a^{-1} + c$$

Where $a^{-1}$ is the multiplicative inverse in GF(2⁸) with $p(x) = x^8 + x^4 + x^3 + x + 1$, $A$ is a fixed affine matrix, and $c = 0x63$.

MixColumns transformation:

$$\begin{bmatrix} s'_{0,c} \ s'_{1,c} \ s'_{2,c} \ s'_{3,c} \end{bmatrix} = \begin{bmatrix} 02 & 03 & 01 & 01 \ 01 & 02 & 03 & 01 \ 01 & 01 & 02 & 03 \ 03 & 01 & 01 & 02 \end{bmatrix} \begin{bmatrix} s_{0,c} \ s_{1,c} \ s_{2,c} \ s_{3,c} \end{bmatrix}$$

Multiplication by 02 and 03 in GF(2⁸):

$$02 \cdot b = (b \ll 1) \oplus (0\text{x}1B \cdot (b \gg 7))$$ $$03 \cdot b = 02 \cdot b \oplus b$$

String Processing & Text Mining

The String Beast (PCMPESTRI)

Instruction: PCMPESTRI (Packed Compare Explicit Length Strings, Return Index)

File: PCMPESTRI.asm

This SSE4.2 instruction performs complex string operations in a single instruction:

Control byte (imm8) encoding:

Bits	Field	Description
1:0	Source Data Format	00=UBytes, 01=UWords, 10=SBytes, 11=SWords
3:2	Aggregation Operation	00=Equal Any, 01=Ranges, 10=Equal Each, 11=Equal Ordered
5:4	Polarity	00=Positive, 01=Negative, 10=Masked+, 11=Masked-
6	Output Selection	0=Least Significant, 1=Most Significant

Aggregation operations:

Equal Any (00): Character set membership $$\text{IntRes1}[i] = \bigvee_{j=0}^{n-1} (\text{Src1}[j] = \text{Src2}[i])$$
Ranges (01): Character range membership $$\text{IntRes1}[i] = \bigvee_{j=0}^{n/2-1} (\text{Src1}[2j] \leq \text{Src2}[i] \leq \text{Src1}[2j+1])$$
Equal Each (10): Byte-by-byte equality $$\text{IntRes1}[i] = (\text{Src1}[i] = \text{Src2}[i])$$
Equal Ordered (11): Substring search $$\text{IntRes1}[i] = \bigwedge_{j=0}^{n-1} (\text{Src1}[j] = \text{Src2}[i+j])$$

Example: Finding "World" in "Hello World":

; filepath: PCMPESTRI.asm
; ...existing code...
    pcmpestri xmm0, xmm1, 0x0C  ; Equal Ordered, byte data
    ; RCX = 6 (index where "World" starts)
; ...existing code...

Performance vs. scalar loop:

xychart-beta
    title "Substring Search Performance (cycles per search)"
    x-axis ["strlen", "strstr scalar", "PCMPESTRI", "AVX2 hybrid"]
    y-axis "Cycles" 0 --> 500
    bar [45, 380, 85, 120]

Mask Extraction (PMOVMSKB)

Instruction: PMOVMSKB (Move Byte Mask)

File: PMOVMSKB.asm

Extracts the most significant bit of each byte into a general-purpose register:

$$\text{Result} = \sum_{i=0}^{15} (\text{xmm}[8i+7] \ll i)$$

Null terminator detection algorithm:

1. Load 16 bytes into XMM register
2. Compare each byte against zero (PCMPEQB)
3. Extract comparison mask (PMOVMSKB)
4. If mask ≠ 0, find first set bit (BSF)
5. Repeat for next 16-byte chunk

Vectorised strlen complexity:

$$T(n) = O\left(\frac{n}{16}\right) + O(1)$$

Compared to scalar: $T(n) = O(n)$

SIMD String Algorithms

Vectorised memchr:

vectorised_memchr:
    ; rdi = buffer, rsi = byte to find, rdx = length
    movd xmm0, esi
    pxor xmm1, xmm1
    pshufb xmm0, xmm1      ; Broadcast byte to all lanes
    
.loop:
    movdqu xmm1, [rdi]
    pcmpeqb xmm1, xmm0
    pmovmskb eax, xmm1
    test eax, eax
    jnz .found
    add rdi, 16
    sub rdx, 16
    jg .loop
    xor eax, eax
    ret
    
.found:
    bsf eax, eax
    add rax, rdi
    ret

Scientific Computing & Signal Processing

Sparse Memory Gather

Instruction: VGATHERDPS (Gather Packed Single-Precision Floating-Point Values with Signed Dword Indices)

File: VGATHERDPS.asm

Performs indexed loads from memory using vector indices:

$$\text{Dest}[i] = \text{Mem}[\text{Base} + \text{Index}[i] \times \text{Scale}]$$

Memory access pattern visualisation:

graph TB
    subgraph "Memory (Linear)"
        M0[addr+0]
        M1[addr+4]
        M2[addr+8]
        M3[addr+12]
        M4[addr+16]
        M5[addr+20]
        M6[addr+24]
        M7[addr+28]
    end
    
    subgraph "Indices ZMM1"
        I0[0]
        I1[4]
        I2[8]
        I3[12]
    end
    
    subgraph "Result ZMM0"
        R0[data@0]
        R1[data@16]
        R2[data@32]
        R3[data@48]
    end
    
    I0 --> M0
    I1 --> M4
    I2 --> M2
    I3 --> M6
    
    M0 --> R0
    M4 --> R1
    M2 --> R2
    M6 --> R3

Gather performance model:

$$T_{\text{gather}} = T_{\text{base}} + N_{\text{cache lines}} \times T_{\text{cache miss}}$$

Where $N_{\text{cache lines}}$ depends on index spread:

$$N_{\text{cache lines}} \leq \min\left(N_{\text{elements}}, \lceil\frac{\max(\text{Index}) - \min(\text{Index})}{\text{Cache Line Size}}\rceil + 1\right)$$

Sparse matrix-vector multiplication:

For $y = Ax$ where $A$ is sparse in CSR format:

spmv_csr:
    ; For each row i:
    ;   y[i] = sum(A.values[j] * x[A.col_idx[j]]) for j in [row_ptr[i], row_ptr[i+1])
    
    ; Load column indices
    vmovdqu32 zmm1, [col_idx + rcx*4]
    
    ; Gather x values at those indices
    kxnorw k1, k1, k1
    vgatherdps zmm2{k1}, [x + zmm1*4]
    
    ; Load A values and multiply
    vmovups zmm3, [values + rcx*4]
    vmulps zmm4, zmm2, zmm3
    
    ; Horizontal sum for y[i]
    ; ...

Video Motion Estimation (MPSADBW)

Instruction: MPSADBW (Multiple Packed Sums of Absolute Differences)

File: MPSADBW.asm

Computes 8 SADs simultaneously for block matching in video codecs:

$$\text{SAD}_k = \sum_{i=0}^{3} |R_i - S_{k+i}|$$

For $k = 0, 1, 2, \ldots, 7$ simultaneously.

Block matching visualisation:

Reference Block (4 bytes): [R₀, R₁, R₂, R₃]

Source Window (11 bytes):  [S₀, S₁, S₂, S₃, S₄, S₅, S₆, S₇, S₈, S₉, S₁₀]
                            ├──────────┤
                            Position 0
                               ├──────────┤
                               Position 1
                                  ├──────────┤
                                  Position 2
                                     ... (8 positions total)

Motion estimation search diamond pattern:

graph TD
    C((Center)) --> N((N))
    C --> S((S))
    C --> E((E))
    C --> W((W))
    N --> |MPSADBW| B1[SAD values]
    S --> |MPSADBW| B2[SAD values]
    E --> |MPSADBW| B3[SAD values]
    W --> |MPSADBW| B4[SAD values]
    B1 --> MIN[Minimum]
    B2 --> MIN
    B3 --> MIN
    B4 --> MIN
    MIN --> |New Center| C

Rate-Distortion optimisation:

$$J = D + \lambda R$$

Where:

$J$ = Cost function to minimise
$D$ = Distortion (SAD or SATD)
$R$ = Bit rate for encoding motion vector
$\lambda$ = Lagrangian multiplier

Fast Math Approximations (RSQRTPS)

Instruction: RSQRTPS (Reciprocal Square Root of Packed Single-Precision Floating-Point Values)

File: RSQRTPS.asm

Computes approximate reciprocal square root:

$$y \approx \frac{1}{\sqrt{x}}$$

With relative error $|\epsilon| < 1.5 \times 2^{-12} \approx 0.00037$.

Newton-Raphson refinement:

For $f(y) = \frac{1}{y^2} - x = 0$, Newton's method gives:

$$y_{n+1} = y_n \cdot \left(\frac{3}{2} - \frac{x}{2} \cdot y_n^2\right)$$

Or equivalently:

$$y_{n+1} = \frac{y_n}{2} \cdot \left(3 - x \cdot y_n^2\right)$$

Implementation with one NR iteration:

fast_rsqrt_nr:
    ; xmm0 = input x
    rsqrtps xmm1, xmm0           ; y0 ≈ 1/sqrt(x)
    
    ; Newton-Raphson: y1 = 0.5 * y0 * (3 - x * y0^2)
    movaps xmm2, xmm1
    mulps xmm2, xmm1             ; y0^2
    mulps xmm2, xmm0             ; x * y0^2
    movaps xmm3, [three]         ; 3.0
    subps xmm3, xmm2             ; 3 - x*y0^2
    mulps xmm3, xmm1             ; y0 * (3 - x*y0^2)
    mulps xmm3, [half]           ; 0.5 * y0 * (3 - x*y0^2)
    
    ; xmm3 = refined result (error < 2^-23)
    ret

Error analysis:

Method	Max Relative Error	Latency (cycles)
RSQRTPS only	$1.5 \times 2^{-12}$	4
RSQRTPS + 1 NR	$< 2^{-22}$	14
RSQRTPS + 2 NR	$< 2^{-44}$	24
SQRTPS + DIVPS	$< 0.5$ ulp	25

Vector normalisation:

For vector $\mathbf{v} = (x, y, z)$:

$$\hat{\mathbf{v}} = \frac{\mathbf{v}}{|\mathbf{v}|} = \mathbf{v} \cdot \frac{1}{\sqrt{x^2 + y^2 + z^2}}$$

Vector Permutation (VPERM2B)

Instruction: VPERM2B (Full Permute of Bytes from Two Tables)

File: VPERM2B.asm

Performs arbitrary byte-level permutation using indices from two source registers:

$$\text{Dest}[i] = \begin{cases} \text{Src1}[\text{Idx}[i] \mod 64] & \text{if } \text{Idx}[i] < 128 \ \text{Src2}[\text{Idx}[i] \mod 64] & \text{if } \text{Idx}[i] \geq 128 \end{cases}$$

Applications:

AES T-table lookups
Base64 encoding/decoding
LUT-based S-boxes
Data shuffling

Example: Byte reversal within 64-bit lanes:

byte_reverse_64:
    ; Control indices for reversing bytes within each qword
    vmovdqu64 zmm2, [reverse_indices]
    ; reverse_indices: 7,6,5,4,3,2,1,0, 15,14,13,12,11,10,9,8, ...
    
    vperm2b zmm0, zmm1, zmm2
    ret

FFT & Convolution Primitives

Radix-2 Cooley-Tukey FFT butterfly:

$$\begin{aligned} X[k] &= E[k] + W_N^k \cdot O[k] \\ X[k + N/2] &= E[k] - W_N^k \cdot O[k] \end{aligned}$$

Where $W_N^k = e^{-2\pi i k/N}$ is the twiddle factor.

Vectorised butterfly (4 complex numbers):

fft_butterfly_avx:
    ; ymm0 = [e0r, e0i, e1r, e1i, e2r, e2i, e3r, e3i] (even)
    ; ymm1 = [o0r, o0i, o1r, o1i, o2r, o2i, o3r, o3i] (odd)
    ; ymm2 = [w0r, w0i, w1r, w1i, w2r, w2i, w3r, w3i] (twiddles)
    
    ; Complex multiply: (a+bi)(c+di) = (ac-bd) + (ad+bc)i
    vpermilps ymm3, ymm1, 0xB1    ; [o0i, o0r, ...]
    vmulps ymm4, ymm1, ymm2       ; [o*wr, o*wi, ...]
    vfmaddsub231ps ymm4, ymm3, ymm2  ; Complex product
    
    ; Butterfly
    vaddps ymm5, ymm0, ymm4       ; E + W*O
    vsubps ymm6, ymm0, ymm4       ; E - W*O
    ret

Convolution theorem:

$$h * x = \mathcal{F}^{-1}{\mathcal{F}{h} \cdot \mathcal{F}{x}}$$

Complexity: $O(n \log n)$ vs. direct convolution $O(n^2)$

Honorable Mentions: The Unsung Heroes

This section provides comprehensive coverage of critical instructions that don't fit neatly into the categories above but are essential for systems programming.

Arbitrary Precision Arithmetic (ADCX/ADOX)

Instructions: ADCX (Add with Carry Flag) and ADOX (Add with Overflow Flag)

These instructions enable dual carry chains for efficient big-integer arithmetic:

$$\text{ADCX: } \text{dest} = \text{dest} + \text{src} + \text{CF}, \quad \text{CF} \leftarrow \text{carry}$$ $$\text{ADOX: } \text{dest} = \text{dest} + \text{src} + \text{OF}, \quad \text{OF} \leftarrow \text{carry}$$

Why two carry chains matter:

In traditional multi-precision addition, each ADD depends on the previous carry:

Traditional (serial):
   ADD r0, a0, b0  ; CF = carry0
   ADC r1, a1, b1  ; needs carry0, CF = carry1
   ADC r2, a2, b2  ; needs carry1, CF = carry2
   ...

With ADCX/ADOX, two independent additions can proceed in parallel:

Parallel carry chains:
   ADCX r0, a0     ; CF chain
   ADOX r4, c0     ; OF chain (independent!)
   ADCX r1, a1     ; CF chain continues
   ADOX r5, c1     ; OF chain continues

RSA Montgomery multiplication:

The Montgomery reduction algorithm computes:

$$\text{MonPro}(a, b) = a \cdot b \cdot R^{-1} \mod N$$

Where $R = 2^k$ for k-bit modulus. ADCX/ADOX accelerate the inner loop:

montmul_inner:
    ; Compute partial product and accumulate
    mulx rdx, rax, [b + rcx*8]  ; rdx:rax = a[i] * b[j]
    adcx r8, rax                 ; Accumulate low (CF chain)
    adox r9, rdx                 ; Accumulate high (OF chain)
    ; Continue for all limbs...

Performance for 4096-bit RSA:

xychart-beta
    title "RSA-4096 Performance (ops/second)"
    x-axis ["Scalar ADD", "ADCX/ADOX", "AVX-512 IFMA"]
    y-axis "Operations/s" 0 --> 2000
    bar [450, 1200, 1850]

Cycle-Accurate Timing (RDTSC/RDTSCP)

Instructions: RDTSC (Read Time-Stamp Counter) and RDTSCP (with Processor ID)

The Time-Stamp Counter is a 64-bit register counting reference cycles since reset:

$$\text{TSC}_{current} = \text{TSC}_{boot} + \text{RefCycles}_{elapsed}$$

Precise timing template:

benchmark_function:
    ; Serialise to prevent out-of-order execution
    cpuid
    rdtsc
    shl rdx, 32
    or rax, rdx
    mov r12, rax          ; Start timestamp
    
    ; === Code to benchmark ===
    call target_function
    ; =========================
    
    ; Serialise again
    rdtscp
    shl rdx, 32
    or rax, rdx
    sub rax, r12          ; Elapsed cycles
    ret

TSC frequency calculation:

On modern processors, TSC runs at the "nominal" frequency:

$$f_{TSC} \approx f_{base} \text{ (invariant TSC)}$$

To convert cycles to nanoseconds:

$$t_{ns} = \frac{\Delta\text{TSC} \times 10^9}{f_{TSC}}$$

Caveats:

Older CPUs: TSC varies with frequency scaling
Multi-socket: TSC may differ between sockets
VM environments: TSC may be virtualised

Population Count (POPCNT)

Instruction: POPCNT (Population Count)

Counts the number of set bits (Hamming weight):

$$\text{POPCNT}(x) = \sum_{i=0}^{63} x_i$$

Applications:

Chess engines (bitboards):

Knights attack mask: popcount(knight_attacks & enemy_pieces)

Bioinformatics (Hamming distance): $$d_H(x, y) = \text{POPCNT}(x \oplus y)$$
Bloom filters: $$\text{Fill ratio} = \frac{\text{POPCNT}(\text{filter})}{\text{size}}$$

Set cardinality in bitmaps:

size_t count = 0;
for (size_t i = 0; i < n; i++)
    count += __builtin_popcountll(bitmap[i]);

Vectorised population count (AVX-512 VPOPCNTDQ):

; Count bits in 8 qwords simultaneously
vpopcntq zmm1, zmm0
; Each qword in zmm1 contains popcount of corresponding qword in zmm0

Intentional Undefined (UD2)

Instruction: UD2 (Undefined Instruction)

Intentionally raises an Invalid Opcode Exception (#UD):

Opcode: 0F 0B
Behavior: #UD exception (Interrupt 6)

Use cases:

Compiler trap for unreachable code:

__builtin_unreachable();  // Often compiles to UD2

Debug assertions:

if (!condition) __builtin_trap();  // UD2

Kernel panic/halt:

kernel_panic:
    cli
    ud2
    jmp kernel_panic

Marking invalid code paths:

; After noreturn function
call exit
ud2              ; Should never execute

Hardware Random Numbers (RDRAND/RDSEED)

Instructions:

RDRAND: Get random number from DRNG (NIST SP 800-90A compliant)
RDSEED: Get entropy directly from hardware source

Entropy source:

Intel's on-die Digital Random Number Generator uses:

Thermal noise from transistor metastability
AES-CBC-MAC conditioning
CTR_DRBG for RDRAND output

$$\text{RDSEED} \rightarrow \text{Entropy Pool} \xrightarrow{\text{CTR_DRBG}} \text{RDRAND}$$

Usage pattern:

get_secure_random:
    xor ecx, ecx
.retry:
    rdrand rax
    jc .success         ; CF=1 means valid random
    inc ecx
    cmp ecx, 10
    jl .retry
    xor eax, eax        ; Failed after 10 attempts
    ret
.success:
    ; rax contains 64 bits of random data
    ret

RDSEED vs RDRAND:

Property	RDRAND	RDSEED
Source	DRBG output	Raw entropy
Rate	Higher (several GB/s)	Lower (limited by entropy)
Failure rate	Very low	Higher under heavy load
Use case	General random	Seeding other PRNGs

Transactional Memory (XBEGIN/XEND)

Instructions:

XBEGIN rel32: Begin transaction (branch to rel32 on abort)
XEND: Commit transaction
XABORT imm8: Explicitly abort with reason code

Transactional execution model:

stateDiagram-v2
    [*] --> Transactional: XBEGIN
    Transactional --> Committed: XEND
    Transactional --> Aborted: Conflict/XABORT
    Committed --> [*]
    Aborted --> FallbackPath
    FallbackPath --> [*]

Lock elision pattern:

transaction_update:
    xacquire lock add [lock_var], 1  ; HLE prefix (deprecated)
    ; Or with RTM:
    xbegin .fallback
    
    ; Transactional region - no locks needed!
    mov rax, [shared_data]
    add rax, 1
    mov [shared_data], rax
    
    xend
    ret
    
.fallback:
    ; Transaction aborted - use traditional locking
    lock add [shared_data], 1
    ret

Abort codes (returned in EAX):

Bit	Meaning
0	Abort caused by XABORT
1	Transaction may succeed on retry
2	Conflict with another processor
3	Buffer overflow
4	Debug breakpoint hit
5	Nested transaction abort
23:24	XABORT immediate value

Current status: Intel TSX has been disabled on many processors due to security vulnerabilities (TAA, ZombieLoad).

AES Acceleration (AESENC/AESDEC)

Instructions:

AESENC xmm, xmm/m128: One AES encryption round
AESENCLAST xmm, xmm/m128: Final AES encryption round
AESDEC xmm, xmm/m128: One AES decryption round
AESDECLAST xmm, xmm/m128: Final AES decryption round
AESKEYGENASSIST xmm, xmm/m128, imm8: Key expansion assistance

AES round structure:

Each AESENC performs:

SubBytes (S-box substitution)
ShiftRows (cyclic row shifts)
MixColumns (column mixing)
AddRoundKey (XOR with round key)

$$\text{AESENC}(\text{state}, \text{key}) = \text{AddRoundKey}(\text{MixColumns}(\text{ShiftRows}(\text{SubBytes}(\text{state}))), \text{key})$$

Complete AES-128 encryption:

aes128_encrypt:
    ; xmm0 = plaintext, xmm1-xmm11 = round keys
    pxor xmm0, xmm1           ; Round 0: AddRoundKey only
    aesenc xmm0, xmm2         ; Round 1
    aesenc xmm0, xmm3         ; Round 2
    aesenc xmm0, xmm4         ; Round 3
    aesenc xmm0, xmm5         ; Round 4
    aesenc xmm0, xmm6         ; Round 5
    aesenc xmm0, xmm7         ; Round 6
    aesenc xmm0, xmm8         ; Round 7
    aesenc xmm0, xmm9         ; Round 8
    aesenc xmm0, xmm10        ; Round 9
    aesenclast xmm0, xmm11    ; Round 10 (no MixColumns)
    ; xmm0 = ciphertext
    ret

Performance (cycles per byte):

Implementation	AES-128	AES-256
Software table	~20 cpb	~25 cpb
AES-NI	~0.7 cpb	~0.9 cpb
AES-NI + pipelining	~0.4 cpb	~0.5 cpb

Cache Control (CLFLUSH/CLFLUSHOPT/CLWB)

Instructions:

CLFLUSH m8: Flush cache line (ordered, slow)
CLFLUSHOPT m8: Flush cache line (weakly ordered, fast)
CLWB m8: Write back cache line (keep in cache)

Cache hierarchy:

┌─────────────┐
│   L1 Data   │  32-64 KB, ~4 cycles
├─────────────┤
│     L2      │  256 KB - 1 MB, ~12 cycles
├─────────────┤
│     L3      │  8-64 MB, ~40 cycles
├─────────────┤
│    DRAM     │  ~200 cycles
└─────────────┘

Use cases:

Persistent memory (PMEM) programming:

; Ensure data reaches persistent storage
mov [pmem_addr], rax
clwb [pmem_addr]
sfence

DMA coherency:

; Flush before DMA read from device
clflush [dma_buffer]
mfence
; Now safe for device to read

Security (defense against cache timing attacks):

; Clear sensitive data from cache
clflush [secret_key]
clflush [secret_key + 64]

Rowhammer research/exploitation:

.hammer:
    mov rax, [addr1]
    clflush [addr1]
    mov rbx, [addr2]
    clflush [addr2]
    mfence
    jmp .hammer

Byte Swap (BSWAP)

Instruction: BSWAP reg32/64

Reverses byte order for endianness conversion:

$$\text{BSWAP}(0x\text{AABBCCDD}) = 0x\text{DDCCBBAA}$$

32-bit transformation:

Before: [B₃, B₂, B₁, B₀] (little-endian)
After:  [B₀, B₁, B₂, B₃] (big-endian)

Network byte order conversion:

; htonl equivalent
htonl:
    bswap edi
    mov eax, edi
    ret

; ntohll (64-bit)
ntohll:
    bswap rdi
    mov rax, rdi
    ret

SIMD byte reversal (for vectors):

bswap_16bytes:
    ; xmm0 = input
    ; Reverse bytes using PSHUFB
    movdqa xmm1, [bswap_mask]  ; 15,14,13,...,2,1,0
    pshufb xmm0, xmm1
    ret

Cache Invalidation (INVD/WBINVD)

Instructions:

INVD: Invalidate caches WITHOUT writeback (Ring 0 only)
WBINVD: Write back AND invalidate all caches (Ring 0 only)

⚠️ DANGER LEVEL: EXTREME ⚠️

INVD sequence:
1. All modified cache lines are DISCARDED
2. Memory state becomes inconsistent
3. System may crash or corrupt data
4. Only use in very specific firmware scenarios

Legitimate use cases:

BIOS/UEFI initialisation:

; Before enabling caching for the first time
wbinvd
; Configure MTRRs
; Enable caching

Cache-as-RAM (CAR) teardown:

; After copying CAR data to real RAM
wbinvd
; Reconfigure caching

Hardware reset preparation:

; Ensure memory is consistent before reset
wbinvd
cli
hlt

Bit Manipulation (BMI1/BMI2)

BMI1 instructions:

ANDN r, r, r/m: $\text{dest} = (\lnot \text{src1}) \land \text{src2}$
BEXTR r, r/m, r: Extract bit field
BLSI r, r/m: Isolate lowest set bit: $\text{dest} = \text{src} \land (-\text{src})$
BLSMSK r, r/m: Mask up to lowest set bit: $\text{dest} = \text{src} \oplus (\text{src} - 1)$
BLSR r, r/m: Reset lowest set bit: $\text{dest} = \text{src} \land (\text{src} - 1)$
TZCNT r, r/m: Count trailing zeros

BMI2 instructions:

BZHI r, r/m, r: Zero high bits starting at position
MULX r, r, r/m: Unsigned multiply without affecting flags
PDEP r, r, r/m: Parallel bit deposit
PEXT r, r, r/m: Parallel bit extract
RORX r, r/m, imm8: Rotate right without affecting flags
SARX/SHLX/SHRX r, r/m, r: Shifts without affecting flags

PDEP/PEXT visualisation:

PEXT (extract):
  Source:  1 0 1 1 0 1 1 0
  Mask:    1 0 1 0 0 1 0 0
  Result:  0 0 0 0 0 1 1 1  (extracted bits: 1,1,1)

PDEP (deposit):
  Source:  0 0 0 0 0 1 0 1
  Mask:    1 0 1 0 0 1 0 0
  Result:  1 0 0 0 0 1 0 0  (deposited at mask positions)

Chess move generation with PEXT:

generate_rook_attacks:
    ; rdi = square, rsi = occupied bitboard
    mov rax, [rook_masks + rdi*8]   ; Get relevant occupancy mask
    pext rax, rsi, rax               ; Extract relevant bits
    mov rcx, [rook_attacks + rdi*8]  ; Get attack table base
    mov rax, [rcx + rax*8]           ; Look up attacks
    ret

CRC32 Acceleration

Instruction: CRC32 r, r/m

Hardware-accelerated CRC-32C (Castagnoli polynomial):

$$P(x) = x^{32} + x^{28} + x^{27} + x^{26} + x^{25} + x^{23} + x^{22} + x^{20} + x^{19} + x^{18} + x^{14} + x^{13} + x^{11} + x^{10} + x^{9} + x^{8} + x^{6} + 1$$

Polynomial representation: 0x1EDC6F41 (iSCSI, ext4, Btrfs)

Usage:

crc32c_block:
    ; rdi = buffer, rsi = length
    xor eax, eax          ; Initialise CRC to 0 (or ~0 for standard)
    
.loop_8:
    cmp rsi, 8
    jl .loop_1
    crc32 rax, qword [rdi]
    add rdi, 8
    sub rsi, 8
    jmp .loop_8
    
.loop_1:
    test rsi, rsi
    jz .done
    crc32 rax, byte [rdi]
    inc rdi
    dec rsi
    jmp .loop_1
    
.done:
    ; Finalise: return ~crc for standard CRC-32C
    not eax
    ret

Performance comparison:

Method	Throughput
Table-driven (8-bit)	~500 MB/s
Slicing-by-8	~2 GB/s
CRC32 instruction	~20 GB/s
PCLMULQDQ folding	~40 GB/s

Performance Analysis & Microarchitecture

Instruction Latency Tables

Key instructions across microarchitectures (cycles):

Instruction	Skylake	Ice Lake	Zen 3	Zen 4
VADDPS (256)	4	4	3	3
VMULPS (256)	4	4	3	3
VFMADD (256)	4	4	4	4
VDIVPS (256)	11	11	10	10
VSQRTPS (256)	12	12	12	13
VPERM2B (512)	3	3	-	4
VPDPBUSD (512)	5	5	-	4
PCLMULQDQ	7	6	4	3
AESENC	4	4	4	3

Extended latency table (additional instructions):

Instruction	Skylake	Ice Lake	Zen 3	Zen 4	Description
VADDPS (512)	4	4	-	3	FP32 vector add
VMULPS (512)	4	4	-	3	FP32 vector multiply
VFMADD (512)	4	4	-	4	Fused multiply-add
VDIVPS (512)	18	11	-	10	FP32 vector divide
VSQRTPS (512)	20	19	-	13	FP32 vector sqrt
VPAND (256)	1	1	1	1	Bitwise AND
VPOR (256)	1	1	1	1	Bitwise OR
VPXOR (256)	1	1	1	1	Bitwise XOR
VPSHUFB (256)	1	1	1	1	Byte shuffle
VPERMD (256)	3	3	2	2	Dword permute
VPERMQ (256)	3	3	2	2	Qword permute
VGATHERDPS (256)	22	15	12	10	Gather (best case)
VSCATTERDPS (512)	40	20	-	15	Scatter (best case)
POPCNT	3	3	1	1	Population count
LZCNT	3	3	1	1	Leading zero count
TZCNT	3	3	1	1	Trailing zero count
PDEP	3	3	18*	3	Parallel bit deposit
PEXT	3	3	18*	3	Parallel bit extract
CRC32	3	3	3	3	CRC acceleration
RDTSC	20	15	9	8	Read timestamp
RDRAND	~400	~200	~1200	~800	Hardware random

*Note: Zen 1-3 implement PDEP/PEXT via microcode, resulting in high latency. Zen 4 has native hardware support.

Port Utilisation

Skylake/Ice Lake execution ports:

graph TD
    subgraph "Front End (4-wide decode)"
        IF[Instruction Fetch] --> ID[Instruction Decode]
        ID --> UOP[µop Queue]
        UOP --> RAT[Register Alias Table]
        RAT --> ROB[Reorder Buffer<br/>224 entries]
    end
    
    subgraph "Execution Engine (8 ports)"
        ROB --> RS[Reservation Station<br/>97 entries]
        RS --> P0[Port 0<br/>ALU, FMA<br/>DIV, SQRT]
        RS --> P1[Port 1<br/>ALU, FMA<br/>AES, CLMUL]
        RS --> P2[Port 2<br/>Load<br/>AGU]
        RS --> P3[Port 3<br/>Load<br/>AGU]
        RS --> P4[Port 4<br/>Store Data]
        RS --> P5[Port 5<br/>ALU, Shuffle<br/>Branch]
        RS --> P6[Port 6<br/>ALU<br/>Branch]
        RS --> P7[Port 7<br/>Store AGU]
    end
    
    subgraph "Memory Subsystem"
        P2 --> L1D[L1 Data Cache<br/>32KB, 8-way]
        P3 --> L1D
        P4 --> SB[Store Buffer<br/>56 entries]
        SB --> L1D
        L1D --> L2[L2 Cache<br/>256KB-1MB]
        L2 --> L3[L3 Cache<br/>Shared]
    end

Port assignment for key SIMD operations:

Operation Category	Ports	Throughput
FP Add (256-bit)	P0, P1	2/cycle
FP Mul (256-bit)	P0, P1	2/cycle
FP FMA (256-bit)	P0, P1	2/cycle
FP Add (512-bit)	P0, P1	1/cycle
FP Mul (512-bit)	P0, P1	1/cycle
FP FMA (512-bit)	P0, P1	1/cycle
FP Div (any)	P0	1/3-18 cycles
FP Sqrt (any)	P0	1/4-20 cycles
Integer Add/Logic	P0, P1, P5	3/cycle
Shuffle (256-bit)	P5	1/cycle
Shuffle (512-bit)	P5	0.5/cycle
AES round	P0	1/cycle
CLMUL	P0	1/cycle
Load (256-bit)	P2, P3	2/cycle
Load (512-bit)	P2+P3	1/cycle
Store (256-bit)	P4+P7	1/cycle
Store (512-bit)	P4+P7	1/cycle

Throughput Optimisation

Reciprocal throughput vs. latency:

Understanding the difference is critical for optimisation:

Latency: Cycles from input ready to output ready (single instruction)
Throughput: How often an instruction can start (reciprocal = cycles between starts)

$$\text{Effective IPC} = \min\left(\frac{1}{\text{Reciprocal Throughput}}, \frac{\text{Pipeline Depth}}{\text{Latency}}\right)$$

Example: FMA throughput analysis:

Latency = 4 cycles
Throughput = 2 ops/cycle (ports P0 + P1)
Reciprocal Throughput = 0.5 cycles

To saturate both FMA units:
- Need 4 × 2 = 8 independent FMA chains
- Each chain must have 4 instructions in flight

Throughput comparison chart:

xychart-beta
    title "Peak SIMD Throughput (GFLOPS @ 3GHz)"
    x-axis ["SSE (128)", "AVX (256)", "AVX-512", "AVX-512 VNNI"]
    y-axis "GFLOPS" 0 --> 400
    bar [48, 96, 192, 384]

Calculating theoretical peak:

$$\text{GFLOPS}_{\text{peak}} = \frac{\text{Vector Width (bits)}}{32} \times \text{FMA ops/cycle} \times 2 \times f_{\text{GHz}}$$

For AVX-512 at 3 GHz with 2 FMA units: $$\text{GFLOPS} = \frac{512}{32} \times 2 \times 2 \times 3 = 192 \text{ GFLOPS}$$

Microarchitectural Hazards

1. Port Contention:

When multiple instructions compete for the same execution port:

; BAD: All three need Port 0
vdivps ymm0, ymm1, ymm2    ; P0
vsqrtps ymm3, ymm4          ; P0
vrcpps ymm5, ymm6           ; P0
; Serialised execution!

; BETTER: Interleave with different ports
vdivps ymm0, ymm1, ymm2    ; P0 (starts)
vaddps ymm7, ymm8, ymm9    ; P0 or P1
vpshufb ymm10, ymm11, ymm12 ; P5
vmulps ymm13, ymm14, ymm15  ; P0 or P1
; Better utilisation

2. Register Dependencies:

Read-after-write (RAW) hazards create stalls:

; Latency-bound chain (16 cycles total)
vfmadd231ps ymm0, ymm1, ymm2   ; 4 cycles
vfmadd231ps ymm0, ymm3, ymm4   ; waits for ymm0
vfmadd231ps ymm0, ymm5, ymm6   ; waits for ymm0
vfmadd231ps ymm0, ymm7, ymm8   ; waits for ymm0

; Throughput-optimised (parallel, 4 cycles + reduce)
vfmadd231ps ymm0, ymm1, ymm2   ; 4 independent chains
vfmadd231ps ymm10, ymm3, ymm4
vfmadd231ps ymm11, ymm5, ymm6
vfmadd231ps ymm12, ymm7, ymm8
vaddps ymm0, ymm0, ymm10       ; Reduce at end
vaddps ymm11, ymm11, ymm12
vaddps ymm0, ymm0, ymm11

3. Memory Bandwidth Limits:

$$\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Transferred}}$$

For memory-bound code: $$\text{Achievable GFLOPS} = \text{Memory BW (GB/s)} \times \text{Arithmetic Intensity}$$

xychart-beta
    title "Roofline Model (DDR4-3200, 50 GB/s)"
    x-axis "Arithmetic Intensity (FLOP/Byte)" 0 --> 20
    y-axis "GFLOPS" 0 --> 200
    line [0, 50, 100, 150, 192, 192, 192, 192, 192, 192]

4. AVX-512 Frequency Throttling:

Heavy AVX-512 usage causes frequency reduction on some Intel CPUs:

License Level	Trigger	Frequency Reduction
L0 (Normal)	Scalar/SSE	0%
L1 (Light AVX)	AVX/AVX2	~3-5%
L2 (Heavy AVX-512)	AVX-512 (FP)	~10-20%

Mitigation strategies:

Use AVX2 for short bursts
Batch AVX-512 work to amortize transition overhead
Profile actual workload performance, not microbenchmarks

Cache Behavior Analysis

Cache line utilisation:

$$\text{Utilisation} = \frac{\text{Bytes Actually Used}}{\text{Cache Lines Fetched} \times 64}$$

Example: Stride access patterns:

Stride	Utilisation	Effective BW
4 bytes (sequential float)	100%	50 GB/s
8 bytes (every other)	50%	25 GB/s
64 bytes (1 per line)	6.25%	3.1 GB/s
4096 bytes (1 per page)	0.1%	50 MB/s

Prefetch distance calculation:

$$\text{Prefetch Distance} = \frac{\text{Memory Latency (cycles)}}{\text{Cycles per Iteration}} \times \text{Bytes per Iteration}$$

For 200-cycle memory latency, 4-cycle loop, 64 bytes/iteration: $$\text{Distance} = \frac{200}{4} \times 64 = 3200 \text{ bytes} = 50 \text{ cache lines}$$

Instruction Fusion and Micro-op Analysis

Macro-op fusion (x86-64):

Certain instruction pairs fuse into single µops:

Pattern	Fused?	Example
CMP + JCC	✓	`cmp rax, rbx; jne label`
TEST + JCC	✓	`test rax, rax; jz label`
ADD + JCC	✓ (Intel)	`add rax, 1; jnz label`
AND + JCC	✗	`and rax, rbx; jz label`

Micro-op counts for common SIMD:

Instruction	µops (Skylake)	µops (Zen 3)
VADDPS ymm	1	1
VFMADD231PS ymm	1	1
VDIVPS ymm	1	1
VPSHUFB ymm	1	1
VGATHERDPS ymm	5	4
VSCATTERDPS zmm	16+	N/A
VPTERNLOGD zmm	1	N/A
VPERMI2D zmm	3	N/A

Branch Prediction and SIMD

Avoiding branches with SIMD:

; Scalar with branch (misprediction penalty ~15 cycles)
.loop:
    mov eax, [rsi]
    cmp eax, 0
    jl .negative
    mov [rdi], eax
    jmp .next
.negative:
    neg eax
    mov [rdi], eax
.next:
    ; ...

; Branchless SIMD (absolute value)
    vmovdqu ymm0, [rsi]
    vpabsd ymm0, ymm0          ; No branches!
    vmovdqu [rdi], ymm0

Blend-based conditional:

$$\text{Result}_i = \text{Mask}_i ,?, A_i : B_i$$

; Conditional max(a, b) without branches
vcmpps ymm2, ymm0, ymm1, 1    ; ymm2 = (ymm0 < ymm1) ? 0xFF : 0x00
vblendvps ymm3, ymm0, ymm1, ymm2  ; Select larger values

Optimisation Patterns and Anti-Patterns

Pattern 1: Loop Unrolling

Calculate optimal unroll factor:

$$U_{opt} = \max\left(\frac{\text{Latency}}{\text{Throughput}}, \frac{\text{Available Regs}}{2}\right)$$

; 4x unrolled dot product
dot_product_unrolled:
    vxorps ymm0, ymm0, ymm0    ; Accumulator 0
    vxorps ymm1, ymm1, ymm1    ; Accumulator 1
    vxorps ymm2, ymm2, ymm2    ; Accumulator 2
    vxorps ymm3, ymm3, ymm3    ; Accumulator 3
    
.loop:
    vmovups ymm4, [rdi]
    vmovups ymm5, [rdi + 32]
    vmovups ymm6, [rdi + 64]
    vmovups ymm7, [rdi + 96]
    
    vfmadd231ps ymm0, ymm4, [rsi]
    vfmadd231ps ymm1, ymm5, [rsi + 32]
    vfmadd231ps ymm2, ymm6, [rsi + 64]
    vfmadd231ps ymm3, ymm7, [rsi + 96]
    
    add rdi, 128
    add rsi, 128
    sub rcx, 32
    jg .loop
    
    ; Reduce 4 accumulators
    vaddps ymm0, ymm0, ymm1
    vaddps ymm2, ymm2, ymm3
    vaddps ymm0, ymm0, ymm2
    ; Horizontal sum of ymm0...
    ret

Pattern 2: Software Pipelining

Overlap loads with computation:

; Pipelined: load next while computing current
.prolog:
    vmovups ymm8, [rdi]        ; Load iteration 0
    add rdi, 32
    
.loop:
    vmovups ymm9, [rdi]        ; Load iteration N+1
    vfmadd231ps ymm0, ymm8, [rsi]  ; Compute iteration N
    vmovaps ymm8, ymm9         ; Prepare for next
    add rdi, 32
    add rsi, 32
    sub rcx, 8
    jg .loop
    
.epilog:
    vfmadd231ps ymm0, ymm8, [rsi]  ; Final iteration

Pattern 3: Horizontal Reduction

Efficient horizontal sum (256-bit):

horizontal_sum_256:
    ; Input: ymm0 = [a, b, c, d, e, f, g, h]
    vextractf128 xmm1, ymm0, 1     ; xmm1 = [e, f, g, h]
    vaddps xmm0, xmm0, xmm1        ; xmm0 = [a+e, b+f, c+g, d+h]
    vhaddps xmm0, xmm0, xmm0       ; xmm0 = [a+e+b+f, c+g+d+h, ...]
    vhaddps xmm0, xmm0, xmm0       ; xmm0 = [sum, sum, sum, sum]
    ; Result in xmm0[0]
    ret

Efficient horizontal sum (512-bit):

horizontal_sum_512:
    ; Input: zmm0
    vextractf64x4 ymm1, zmm0, 1
    vaddps ymm0, ymm0, ymm1
    vextractf128 xmm1, ymm0, 1
    vaddps xmm0, xmm0, xmm1
    vhaddps xmm0, xmm0, xmm0
    vhaddps xmm0, xmm0, xmm0
    ret

Anti-Pattern 1: Unnecessary Zeroing

; BAD: Zeroing upper bits unnecessarily
vmovss xmm0, [rdi]             ; Zeros upper bits (SSE)
vaddss xmm0, xmm0, xmm1        ; But we just need scalar

; BETTER: Use VEX-encoded scalar ops
vaddss xmm0, xmm1, [rdi]       ; Single instruction

Anti-Pattern 2: Lane Crossing Penalties

; EXPENSIVE: Cross-lane shuffle in AVX2 (3-cycle latency)
vpermd ymm0, ymm1, ymm2

; CHEAPER: In-lane operations when possible (1-cycle latency)
vpshufb ymm0, ymm1, ymm2       ; Within 128-bit lanes

Anti-Pattern 3: Gather/Scatter Overuse

; SLOW: Gather for contiguous-ish data
vgatherdps ymm0, [rdi + ymm1*4], ymm2  ; 15+ cycles

; FASTER: Load + permute for predictable patterns
vmovups ymm0, [rdi]
vpermd ymm0, ymm3, ymm0        ; 4-5 cycles total

Debugging and Profiling SIMD Code

Intel VTune Markers

#include <ittnotify.h>

void simd_kernel() {
    __itt_task_begin(__itt_domain_create("SIMD"), __itt_null, __itt_null,
                     __itt_string_handle_create("kernel"));
    
    // SIMD code here
    
    __itt_task_end(__itt_domain_create("SIMD"));
}

Performance Counters

Key events to monitor:

Counter	Description	Target
FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE	256-bit FP ops	Maximise
FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE	512-bit FP ops	Maximise
UOPS_DISPATCHED_PORT.PORT_0	Port 0 usage	Balance
UOPS_DISPATCHED_PORT.PORT_1	Port 1 usage	Balance
UOPS_DISPATCHED_PORT.PORT_5	Port 5 usage	Balance
MEM_LOAD_RETIRED.L1_MISS	L1 cache misses	Minimise
MEM_LOAD_RETIRED.L3_MISS	L3 cache misses	Minimise
MACHINE_CLEARS.COUNT	Pipeline clears	Zero

Using perf on Linux:

perf stat -e fp_arith_inst_retired.256b_packed_single,\
fp_arith_inst_retired.512b_packed_single,\
cycles,instructions ./my_simd_program

Common Issues Checklist

Symptom	Likely Cause	Solution
Low IPC	Dependency chains	Unroll, use more accumulators
High L1 misses	Poor spatial locality	Prefetch, tile loops
Port 0 saturated	Too many DIV/SQRT	Approximate or interleave
Frequency drop	Heavy AVX-512	Consider AVX2, batch work
Branch mispredicts	Data-dependent branches	Use SIMD blend/mask

References & Manuals

Primary References

Intel® 64 and IA-32 Architectures Software Developer's Manual
- Volume 1: Basic Architecture
- Volume 2: Instruction Set Reference
- Volume 3: System Programming Guide
- Intel SDM
Intel® Intrinsics Guide
- Interactive intrinsics reference
- Intel Intrinsics Guide
AMD64 Architecture Programmer's Manual
- Volume 1-5
- AMD Developer Guides
Agner Fog's Optimisation Manuals
- Instruction Tables
- Microarchitecture Guide
- Optimising Assembly
- Optimising C++
- Agner's Site

Microarchitecture Resources

uops.info
- Detailed instruction latency/throughput data
- Port usage information
- uops.info
WikiChip
- CPU microarchitecture details
- WikiChip
Chips and Cheese
- Deep-dive microarchitecture analysis
- Chips and Cheese

Tools

Compiler Explorer (Godbolt)
- See generated assembly
- godbolt.org
LLVM-MCA (Machine Code Analyzer)
- Static analysis of assembly
- llvm-mca -mcpu=skylake < code.s
Intel SDE (Software Development Emulator)
- Emulate future instruction sets
- Intel SDE

Academic Papers

SIMD Vectorisation:
- "Automatic SIMD Vectorisation of SSA-based Control Flow Graphs" (CGO 2011)
- "Exploiting Superword Level Parallelism with Multimedia Instruction Sets" (PLDI 2000)
Performance Modeling:
- "Roofline: An Insightful Visual Performance Model" (CACM 2009)
- "A Stochastic Model for Superscalar Execution" (IEEE TC 2018)

Appendix A: Quick Reference Cards

SSE/AVX Register Naming

XMM0-XMM15:  128-bit (SSE, AVX lower half)
YMM0-YMM15:  256-bit (AVX, AVX2)
ZMM0-ZMM31:  512-bit (AVX-512)
K0-K7:       Opmask registers (AVX-512)

Common Instruction Prefixes

Prefix	Meaning	Example
V	VEX/EVEX encoded	VADDPS vs ADDPS
P	Packed	PMULLD
S	Scalar	ADDSS
B	Byte	PSHUFB
W	Word	PMULLW
D	Dword	PADDD
Q	Qword	PADDQ
H	Horizontal	HADDPS
U	Unaligned	MOVUPS
A	Aligned	MOVAPS
NT	Non-temporal	MOVNTPS

Intrinsic Naming Convention

_mm256_add_ps
  │    │   │
  │    │   └─ Element type (ps=packed single, pd=packed double, 
  │    │                     epi32=packed int32, si256=256-bit int)
  │    └───── Operation
  └────────── Register width (_mm=128, _mm256=256, _mm512=512)

Appendix B: Instruction Set Detection

#include <cpuid.h>
#include <stdbool.h>

typedef struct {
    bool sse, sse2, sse3, ssse3, sse41, sse42;
    bool avx, avx2;
    bool avx512f, avx512bw, avx512dq, avx512vl;
    bool avx512vnni, avx512vbmi, avx512ifma;
    bool bmi1, bmi2, popcnt, aesni, pclmul;
    bool fma, f16c, rdrand, rdseed;
} CPUFeatures;

CPUFeatures detect_features(void) {
    CPUFeatures f = {0};
    unsigned int eax, ebx, ecx, edx;
    
    __cpuid(1, eax, ebx, ecx, edx);
    f.sse    = (edx >> 25) & 1;
    f.sse2   = (edx >> 26) & 1;
    f.sse3   = (ecx >> 0) & 1;
    f.ssse3  = (ecx >> 9) & 1;
    f.sse41  = (ecx >> 19) & 1;
    f.sse42  = (ecx >> 20) & 1;
    f.aesni  = (ecx >> 25) & 1;
    f.pclmul = (ecx >> 1) & 1;
    f.avx    = (ecx >> 28) & 1;
    f.fma    = (ecx >> 12) & 1;
    f.f16c   = (ecx >> 29) & 1;
    f.popcnt = (ecx >> 23) & 1;
    f.rdrand = (ecx >> 30) & 1;
    
    __cpuid_count(7, 0, eax, ebx, ecx, edx);
    f.avx2       = (ebx >> 5) & 1;
    f.bmi1       = (ebx >> 3) & 1;
    f.bmi2       = (ebx >> 8) & 1;
    f.avx512f    = (ebx >> 16) & 1;
    f.avx512dq   = (ebx >> 17) & 1;
    f.avx512ifma = (ebx >> 21) & 1;
    f.avx512bw   = (ebx >> 30) & 1;
    f.avx512vl   = (ebx >> 31) & 1;
    f.avx512vbmi = (ecx >> 1) & 1;
    f.avx512vnni = (ecx >> 11) & 1;
    f.rdseed     = (ebx >> 18) & 1;
    
    return f;
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
MPSADBW.asm		MPSADBW.asm
PCLMULQDQ.asm		PCLMULQDQ.asm
PCMPESTRI.asm		PCMPESTRI.asm
PMOVMSKB.asm		PMOVMSKB.asm
README.md		README.md
RSQRTPS.asm		RSQRTPS.asm
VGATHERDPS.asm		VGATHERDPS.asm
VPDPBUSD.asm		VPDPBUSD.asm
VPERM2B.asm		VPERM2B.asm
VPTERNLOGD.asm		VPTERNLOGD.asm

Folders and files

Latest commit

History

Repository files navigation

x86

Table of Contents

Introduction

The SIMD Advantage

Historical Evolution

Architectural Foundations

Register Hierarchy

Memory Alignment & Cache Considerations

Instruction Encoding

Repository Structure

Deep Learning & Tensor Operations

VNNI Dot Product (INT8)

Ternary Logic

Quantisation Mathematics

Cryptography & Finite Fields

Carry-less Multiplication

Galois Field Arithmetic

AES Round Functions

String Processing & Text Mining

The String Beast (PCMPESTRI)

Mask Extraction (PMOVMSKB)

SIMD String Algorithms

Scientific Computing & Signal Processing

Sparse Memory Gather

Video Motion Estimation (MPSADBW)

Fast Math Approximations (RSQRTPS)

Vector Permutation (VPERM2B)

FFT & Convolution Primitives

Honorable Mentions: The Unsung Heroes

Arbitrary Precision Arithmetic (ADCX/ADOX)

Cycle-Accurate Timing (RDTSC/RDTSCP)

Population Count (POPCNT)

Intentional Undefined (UD2)

Hardware Random Numbers (RDRAND/RDSEED)

Transactional Memory (XBEGIN/XEND)

AES Acceleration (AESENC/AESDEC)

Cache Control (CLFLUSH/CLFLUSHOPT/CLWB)

Byte Swap (BSWAP)

Cache Invalidation (INVD/WBINVD)

Bit Manipulation (BMI1/BMI2)

CRC32 Acceleration

Performance Analysis & Microarchitecture

Instruction Latency Tables

Port Utilisation

Throughput Optimisation

Microarchitectural Hazards

Cache Behavior Analysis

Instruction Fusion and Micro-op Analysis

Branch Prediction and SIMD

Optimisation Patterns and Anti-Patterns

Pattern 1: Loop Unrolling

Pattern 2: Software Pipelining

Pattern 3: Horizontal Reduction

Anti-Pattern 1: Unnecessary Zeroing

Anti-Pattern 2: Lane Crossing Penalties

Anti-Pattern 3: Gather/Scatter Overuse

Debugging and Profiling SIMD Code

Intel VTune Markers

Performance Counters

Common Issues Checklist

References & Manuals

Primary References

Microarchitecture Resources

Tools

Academic Papers

Appendix A: Quick Reference Cards

SSE/AVX Register Naming

Common Instruction Prefixes

Intrinsic Naming Convention

Appendix B: Instruction Set Detection

About

Topics

Resources

Uh oh!

Stars

Watchers

Packages