Skip to content

WillKirkmanM/x86

Repository files navigation

x86

Top 10 x86 Instructions Covering Hardware-Accelerated Cryptography & Finite Field Arithmetic, Vectorised String Processing & Text Mining, Precision Scientific Computing, Digital Signal Processing, FFI, AVX-512 SIMD, Galois Fields (GF(2ⁿ)), AES-NI & Hardware Transactional Memory — Including 10 Honorable Mentions


Table of Contents

  1. Introduction
  2. Architectural Foundations
  3. Repository Structure
  4. Deep Learning & Tensor Operations
  5. Cryptography & Finite Fields
  6. String Processing & Text Mining
  7. Scientific Computing & Signal Processing
  8. Honorable Mentions: The Unsung Heroes
  9. Performance Analysis & Microarchitecture
  10. References & Manuals

Introduction

Single Instruction, Multiple Data (SIMD) represents the fundamental paradigm shift in modern processor design, enabling parallel execution of identical operations across multiple data elements simultaneously. This atlas provides comprehensive coverage of x86-64 SIMD extensions from their SSE origins through the latest AVX-512 instruction subsets.

The SIMD Advantage

The theoretical speedup of SIMD operations follows from Amdahl's Law applied to data parallelism:

$$S_{\text{SIMD}} = \frac{1}{(1-P) + \frac{P}{N}}$$

Where:

  • $S_{\text{SIMD}}$ = Speedup factor
  • $P$ = Fraction of parallelizable code
  • $N$ = Vector width (elements per register)

For AVX-512 with 512-bit registers processing 32-bit floats ($N=16$), and $P=0.95$:

$$S_{\text{SIMD}} = \frac{1}{0.05 + \frac{0.95}{16}} = \frac{1}{0.05 + 0.059} \approx 9.2\times$$

Historical Evolution

timeline
    title x86 SIMD Evolution
    1997 : MMX (64-bit, integer only)
    1999 : SSE (128-bit XMM, SP float)
    2001 : SSE2 (DP float, integer)
    2004 : SSE3 (horizontal ops)
    2006 : SSSE3 (shuffle, abs)
    2007 : SSE4.1 (blend, round)
    2008 : SSE4.2 (strings, CRC)
    2011 : AVX (256-bit YMM)
    2013 : AVX2 (integer 256-bit)
    2016 : AVX-512 (512-bit ZMM)
    2019 : AVX-512 VNNI
    2023 : AVX-512 FP16, AVX10
Loading

Architectural Foundations

Register Hierarchy

The x86-64 SIMD register file follows a nested structure where smaller registers alias to portions of larger ones:

┌──────────────────────────────────────────────────────────────────┐
│                              ZMM0 (512 bits)                     │
├────────────────────────────────────┬─────────────────────────────┤
│              YMM0 (256 bits)       │         (upper 256)         │
├──────────────────┬─────────────────┤                             │
│   XMM0 (128 bits)│   (upper 128)   │                             │
└──────────────────┴─────────────────┴─────────────────────────────┘

Register counts by extension:

Extension Registers Width Total Capacity
SSE-SSE4 XMM0-15 128b 256 bytes
AVX/AVX2 YMM0-15 256b 512 bytes
AVX-512 ZMM0-31 512b 2048 bytes

AVX-512 also introduces 8 opmask registers (k0-k7) for predicated execution:

$$\text{Result}_i = \begin{cases} \text{Operation}(A_i, B_i) & \text{if } k_j[i] = 1 \ \text{Merge/Zero} & \text{if } k_j[i] = 0 \end{cases}$$

Memory Alignment & Cache Considerations

Optimal SIMD performance requires proper data alignment. Misaligned accesses on older architectures incurred significant penalties:

$$T_{\text{misaligned}} = T_{\text{aligned}} + T_{\text{penalty}} \cdot \lceil \frac{\text{offset}}{\text{cache line}} \rceil$$

Cache line split behavior:

graph LR
    subgraph "Cache Line 0 (64 bytes)"
        A[Bytes 0-63]
    end
    subgraph "Cache Line 1 (64 bytes)"
        B[Bytes 64-127]
    end
    
    C[Aligned 64B Load] --> A
    D[Misaligned Load] --> A
    D --> B
Loading

Alignment requirements by instruction set:

Instruction Aligned Variant Unaligned Variant Penalty (pre-Nehalem)
SSE Load MOVAPS MOVUPS ~10 cycles
AVX Load VMOVAPS VMOVUPS ~0-1 cycles
AVX-512 VMOVAPS VMOVUPS ~0 cycles

Instruction Encoding

AVX-512 instructions use the EVEX prefix (4 bytes), enabling:

  • 32 vector registers (5-bit encoding)
  • Opmask registers
  • Broadcast from memory
  • Embedded rounding control
  • Suppress-all-exceptions (SAE)
EVEX Prefix Structure:
┌────────┬────────┬────────┬────────┐
│  62h   │ P0     │ P1     │ P2     │
└────────┴────────┴────────┴────────┘
           │         │         │
           │         │         └── aaa: opmask, z: zeroing, L'L: vector length
           │         └── vvvv: NDS register, W: operand size
           └── RXB: register extensions, R': high-16 regs, mm: opcode map

Repository Structure

graph TD;
    Root[x86 SIMD Atlas]
    Root-->AI_ML[Deep Learning & AI]
    Root-->Crypto[Cryptography]
    Root-->Strings[String Processing]
    Root-->HPC[Scientific HPC]
    Root-->Util[Utility Scripts]
    
    AI_ML-->VPDPBUSD[VPDPBUSD.asm<br/>INT8 Dot Product]
    AI_ML-->VPTERNLOGD[VPTERNLOGD.asm<br/>Ternary Logic]
    
    Crypto-->PCLMULQDQ[PCLMULQDQ.asm<br/>GF Multiplication]
    
    Strings-->PCMPESTRI[PCMPESTRI.asm<br/>Substring Search]
    Strings-->PMOVMSKB[PMOVMSKB.asm<br/>Mask Extraction]
    
    HPC-->VGATHERDPS[VGATHERDPS.asm<br/>Sparse Gather]
    HPC-->VPERM2B[VPERM2B.asm<br/>Byte Permutation]
    HPC-->RSQRTPS[RSQRTPS.asm<br/>Fast Inv Sqrt]
    HPC-->MPSADBW[MPSADBW.asm<br/>Motion Estimation]
    
    Util-->rename[rename_to_uppercase.py]
    Util-->readme[readme.py]
    
    style AI_ML fill:#e1f5fe
    style Crypto fill:#fff3e0
    style Strings fill:#f3e5f5
    style HPC fill:#e8f5e9
Loading

Deep Learning & Tensor Operations

Modern neural networks rely on massive matrix multiplications. The computational intensity of a single convolution layer is:

$$\text{FLOPs} = 2 \times K^2 \times C_{in} \times C_{out} \times H_{out} \times W_{out}$$

Where $K$ = kernel size, $C$ = channels, and $H/W$ = spatial dimensions.

VNNI Dot Product (INT8)

Instruction: VPDPBUSD (Vector Packed Dot Product of Unsigned Byte and Signed Byte to Dword)

File: VPDPBUSD.asm

This instruction performs the core INT8 multiply-accumulate operation essential for quantised neural network inference:

$$\text{ACC}_i = \text{ACC}_i + \sum_{j=0}^{3} A_{4i+j}^{u8} \times B_{4i+j}^{s8}$$

Where each 32-bit accumulator lane $i$ receives products from 4 byte pairs.

Mathematical breakdown for one lane:

Input A (unsigned bytes): [a₀, a₁, a₂, a₃]  (values 0-255)
Input B (signed bytes):   [b₀, b₁, b₂, b₃]  (values -128 to 127)

Output = ACC + (a₀×b₀) + (a₁×b₁) + (a₂×b₂) + (a₃×b₃)

Throughput comparison (ops/cycle on Cascade Lake):

xychart-beta
    title "INT8 Throughput Comparison"
    x-axis ["Scalar", "SSE4", "AVX2", "AVX-512", "VNNI"]
    y-axis "GOP/s per Core" 0 --> 4000
    bar [125, 500, 1000, 2000, 4000]
Loading

Implementation:

; filepath: VPDPBUSD.asm
global vnni_dot_product

section .text
vnni_dot_product:
    ; Inputs:
    ; ZMM0: Accumulator (32-bit integers, 16 lanes)
    ; ZMM1: Input A (Unsigned Bytes - 64 elements)
    ; ZMM2: Input B (Signed Bytes - 64 elements)
    
    vpdpbusd zmm0, zmm1, zmm2
    
    ; Each of 16 dword lanes accumulates:
    ; sum of 4 products of (u8 × s8)
    ret

Quantisation error analysis:

The quantisation from FP32 to INT8 introduces error bounded by:

$$\epsilon_q = \frac{\text{scale}}{2} = \frac{\max(|x|)}{127 \times 2}$$

For symmetric quantisation with scale $s$ and zero-point $z=0$:

$$x_q = \text{round}\left(\frac{x}{s}\right), \quad x_{dq} = x_q \times s$$

Ternary Logic

Instruction: VPTERNLOGD (Bitwise Ternary Logic)

File: VPTERNLOGD.asm

VPTERNLOGD implements any 3-input boolean function via an 8-bit lookup table (immediate):

$$\text{Result} = \text{LUT}[A \cdot 4 + B \cdot 2 + C]$$

The truth table maps all 8 input combinations to outputs:

Index A B C LUT Bit Position
0 0 0 0 imm8[0]
1 0 0 1 imm8[1]
2 0 1 0 imm8[2]
3 0 1 1 imm8[3]
4 1 0 0 imm8[4]
5 1 0 1 imm8[5]
6 1 1 0 imm8[6]
7 1 1 1 imm8[7]

Common immediate values:

Operation Formula imm8
A AND B AND C $A \land B \land C$ 0x80
A OR B OR C $A \lor B \lor C$ 0xFE
A XOR B XOR C $A \oplus B \oplus C$ 0x96
(A AND B) OR C $(A \land B) \lor C$ 0xF8
(A OR B) XOR C $(A \lor B) \oplus C$ 0x1E
Majority(A,B,C) $(A \land B) \lor (B \land C) \lor (A \land C)$ 0xE8
A ? B : C (MUX) $(A \land B) \lor (\lnot A \land C)$ 0xCA

Instruction count reduction:

graph LR
    subgraph "Traditional (3 instructions)"
        A1[VPAND zmm3, zmm0, zmm1]
        A2[VPOR zmm3, zmm3, zmm2]
        A3[VPXOR zmm3, zmm3, zmm4]
    end
    
    subgraph "VPTERNLOGD (1 instruction)"
        B1[VPTERNLOGD zmm0, zmm1, zmm2, imm8]
    end
Loading

Quantisation Mathematics

Affine Quantisation:

$$q = \text{clip}\left(\text{round}\left(\frac{r}{s}\right) + z, q_{min}, q_{max}\right)$$

Where:

  • $r$ = real value
  • $s$ = scale factor
  • $z$ = zero point
  • $q_{min}, q_{max}$ = quantised range bounds

Scale and zero-point computation:

$$s = \frac{r_{max} - r_{min}}{q_{max} - q_{min}}$$

$$z = q_{min} - \text{round}\left(\frac{r_{min}}{s}\right)$$

Matrix multiplication in quantised domain:

For $Y = XW + b$:

$$Y_q = \frac{s_x \cdot s_w}{s_y}\left(X_q - z_x\right)\left(W_q - z_w\right) + z_y + \frac{b}{s_y}$$


Cryptography & Finite Fields

Carry-less Multiplication

Instruction: PCLMULQDQ (Carry-Less Multiplication Quadword)

File: PCLMULQDQ.asm

Performs polynomial multiplication over GF(2), essential for:

  • AES-GCM authentication tags
  • CRC computation
  • Error-correcting codes

Mathematical definition:

For polynomials $A(x) = \sum_{i=0}^{63} a_i x^i$ and $B(x) = \sum_{j=0}^{63} b_j x^j$:

$$C(x) = A(x) \cdot B(x) = \sum_{k=0}^{126} c_k x^k$$

Where:

$$c_k = \bigoplus_{i+j=k} (a_i \land b_j)$$

Note: XOR ($\oplus$) replaces addition (no carry propagation).

Immediate byte selection:

imm8 Operation
0x00 xmm1[63:0] × xmm2[63:0]
0x01 xmm1[127:64] × xmm2[63:0]
0x10 xmm1[63:0] × xmm2[127:64]
0x11 xmm1[127:64] × xmm2[127:64]

GCM GHASH computation:

GHASH(H, A, C) = X_m+n+1

Where:
X_0 = 0
X_i = (X_{i-1} ⊕ A_i) • H    for i = 1..m  (AAD blocks)
X_i = (X_{i-1} ⊕ C_i) • H    for i = m+1..m+n  (Ciphertext blocks)
X_{m+n+1} = (X_{m+n} ⊕ (len(A)||len(C))) • H

The multiplication (•) in GF(2¹²⁸) uses PCLMULQDQ followed by reduction modulo the GCM polynomial:

$$p(x) = x^{128} + x^7 + x^2 + x + 1$$

Galois Field Arithmetic

GF(2ⁿ) representation:

Elements are polynomials with binary coefficients, represented as n-bit integers:

$$a = a_{n-1}x^{n-1} + a_{n-2}x^{n-2} + \cdots + a_1x + a_0 \leftrightarrow (a_{n-1}, a_{n-2}, \ldots, a_0)_2$$

Reduction algorithm for GF(2¹²⁸):

After PCLMULQDQ produces a 256-bit product, reduce modulo the irreducible polynomial:

$$\text{Result} = \text{Product} \mod (x^{128} + x^7 + x^2 + x + 1)$$

; GF(2^128) reduction after PCLMULQDQ
; Input: xmm0:xmm1 = 256-bit product
; Output: xmm0 = 128-bit reduced result

gf128_reduce:
    ; Fold high 128 bits into low
    movdqa xmm2, xmm1
    pclmulqdq xmm2, xmm3, 0x00  ; xmm3 contains reduction constant
    pxor xmm0, xmm2
    
    ; Second reduction round
    movdqa xmm2, xmm0
    psrldq xmm2, 8
    pclmulqdq xmm2, xmm3, 0x00
    pxor xmm0, xmm2
    ret

AES Round Functions

SubBytes transformation (S-box):

$$S(a) = A \cdot a^{-1} + c$$

Where $a^{-1}$ is the multiplicative inverse in GF(2⁸) with $p(x) = x^8 + x^4 + x^3 + x + 1$, $A$ is a fixed affine matrix, and $c = 0x63$.

MixColumns transformation:

$$\begin{bmatrix} s'_{0,c} \ s'_{1,c} \ s'_{2,c} \ s'_{3,c} \end{bmatrix} = \begin{bmatrix} 02 &amp; 03 &amp; 01 &amp; 01 \ 01 &amp; 02 &amp; 03 &amp; 01 \ 01 &amp; 01 &amp; 02 &amp; 03 \ 03 &amp; 01 &amp; 01 &amp; 02 \end{bmatrix} \begin{bmatrix} s_{0,c} \ s_{1,c} \ s_{2,c} \ s_{3,c} \end{bmatrix}$$

Multiplication by 02 and 03 in GF(2⁸):

$$02 \cdot b = (b \ll 1) \oplus (0\text{x}1B \cdot (b \gg 7))$$ $$03 \cdot b = 02 \cdot b \oplus b$$


String Processing & Text Mining

The String Beast (PCMPESTRI)

Instruction: PCMPESTRI (Packed Compare Explicit Length Strings, Return Index)

File: PCMPESTRI.asm

This SSE4.2 instruction performs complex string operations in a single instruction:

Control byte (imm8) encoding:

Bits Field Description
1:0 Source Data Format 00=UBytes, 01=UWords, 10=SBytes, 11=SWords
3:2 Aggregation Operation 00=Equal Any, 01=Ranges, 10=Equal Each, 11=Equal Ordered
5:4 Polarity 00=Positive, 01=Negative, 10=Masked+, 11=Masked-
6 Output Selection 0=Least Significant, 1=Most Significant

Aggregation operations:

  1. Equal Any (00): Character set membership $$\text{IntRes1}[i] = \bigvee_{j=0}^{n-1} (\text{Src1}[j] = \text{Src2}[i])$$

  2. Ranges (01): Character range membership $$\text{IntRes1}[i] = \bigvee_{j=0}^{n/2-1} (\text{Src1}[2j] \leq \text{Src2}[i] \leq \text{Src1}[2j+1])$$

  3. Equal Each (10): Byte-by-byte equality $$\text{IntRes1}[i] = (\text{Src1}[i] = \text{Src2}[i])$$

  4. Equal Ordered (11): Substring search $$\text{IntRes1}[i] = \bigwedge_{j=0}^{n-1} (\text{Src1}[j] = \text{Src2}[i+j])$$

Example: Finding "World" in "Hello World":

; filepath: PCMPESTRI.asm
; ...existing code...
    pcmpestri xmm0, xmm1, 0x0C  ; Equal Ordered, byte data
    ; RCX = 6 (index where "World" starts)
; ...existing code...

Performance vs. scalar loop:

xychart-beta
    title "Substring Search Performance (cycles per search)"
    x-axis ["strlen", "strstr scalar", "PCMPESTRI", "AVX2 hybrid"]
    y-axis "Cycles" 0 --> 500
    bar [45, 380, 85, 120]
Loading

Mask Extraction (PMOVMSKB)

Instruction: PMOVMSKB (Move Byte Mask)

File: PMOVMSKB.asm

Extracts the most significant bit of each byte into a general-purpose register:

$$\text{Result} = \sum_{i=0}^{15} (\text{xmm}[8i+7] \ll i)$$

Null terminator detection algorithm:

1. Load 16 bytes into XMM register
2. Compare each byte against zero (PCMPEQB)
3. Extract comparison mask (PMOVMSKB)
4. If mask ≠ 0, find first set bit (BSF)
5. Repeat for next 16-byte chunk

Vectorised strlen complexity:

$$T(n) = O\left(\frac{n}{16}\right) + O(1)$$

Compared to scalar: $T(n) = O(n)$

SIMD String Algorithms

Vectorised memchr:

vectorised_memchr:
    ; rdi = buffer, rsi = byte to find, rdx = length
    movd xmm0, esi
    pxor xmm1, xmm1
    pshufb xmm0, xmm1      ; Broadcast byte to all lanes
    
.loop:
    movdqu xmm1, [rdi]
    pcmpeqb xmm1, xmm0
    pmovmskb eax, xmm1
    test eax, eax
    jnz .found
    add rdi, 16
    sub rdx, 16
    jg .loop
    xor eax, eax
    ret
    
.found:
    bsf eax, eax
    add rax, rdi
    ret

Scientific Computing & Signal Processing

Sparse Memory Gather

Instruction: VGATHERDPS (Gather Packed Single-Precision Floating-Point Values with Signed Dword Indices)

File: VGATHERDPS.asm

Performs indexed loads from memory using vector indices:

$$\text{Dest}[i] = \text{Mem}[\text{Base} + \text{Index}[i] \times \text{Scale}]$$

Memory access pattern visualisation:

graph TB
    subgraph "Memory (Linear)"
        M0[addr+0]
        M1[addr+4]
        M2[addr+8]
        M3[addr+12]
        M4[addr+16]
        M5[addr+20]
        M6[addr+24]
        M7[addr+28]
    end
    
    subgraph "Indices ZMM1"
        I0[0]
        I1[4]
        I2[8]
        I3[12]
    end
    
    subgraph "Result ZMM0"
        R0[data@0]
        R1[data@16]
        R2[data@32]
        R3[data@48]
    end
    
    I0 --> M0
    I1 --> M4
    I2 --> M2
    I3 --> M6
    
    M0 --> R0
    M4 --> R1
    M2 --> R2
    M6 --> R3
Loading

Gather performance model:

$$T_{\text{gather}} = T_{\text{base}} + N_{\text{cache lines}} \times T_{\text{cache miss}}$$

Where $N_{\text{cache lines}}$ depends on index spread:

$$N_{\text{cache lines}} \leq \min\left(N_{\text{elements}}, \lceil\frac{\max(\text{Index}) - \min(\text{Index})}{\text{Cache Line Size}}\rceil + 1\right)$$

Sparse matrix-vector multiplication:

For $y = Ax$ where $A$ is sparse in CSR format:

spmv_csr:
    ; For each row i:
    ;   y[i] = sum(A.values[j] * x[A.col_idx[j]]) for j in [row_ptr[i], row_ptr[i+1])
    
    ; Load column indices
    vmovdqu32 zmm1, [col_idx + rcx*4]
    
    ; Gather x values at those indices
    kxnorw k1, k1, k1
    vgatherdps zmm2{k1}, [x + zmm1*4]
    
    ; Load A values and multiply
    vmovups zmm3, [values + rcx*4]
    vmulps zmm4, zmm2, zmm3
    
    ; Horizontal sum for y[i]
    ; ...

Video Motion Estimation (MPSADBW)

Instruction: MPSADBW (Multiple Packed Sums of Absolute Differences)

File: MPSADBW.asm

Computes 8 SADs simultaneously for block matching in video codecs:

$$\text{SAD}_k = \sum_{i=0}^{3} |R_i - S_{k+i}|$$

For $k = 0, 1, 2, \ldots, 7$ simultaneously.

Block matching visualisation:

Reference Block (4 bytes): [R₀, R₁, R₂, R₃]

Source Window (11 bytes):  [S₀, S₁, S₂, S₃, S₄, S₅, S₆, S₇, S₈, S₉, S₁₀]
                            ├──────────┤
                            Position 0
                               ├──────────┤
                               Position 1
                                  ├──────────┤
                                  Position 2
                                     ... (8 positions total)

Motion estimation search diamond pattern:

graph TD
    C((Center)) --> N((N))
    C --> S((S))
    C --> E((E))
    C --> W((W))
    N --> |MPSADBW| B1[SAD values]
    S --> |MPSADBW| B2[SAD values]
    E --> |MPSADBW| B3[SAD values]
    W --> |MPSADBW| B4[SAD values]
    B1 --> MIN[Minimum]
    B2 --> MIN
    B3 --> MIN
    B4 --> MIN
    MIN --> |New Center| C
Loading

Rate-Distortion optimisation:

$$J = D + \lambda R$$

Where:

  • $J$ = Cost function to minimise
  • $D$ = Distortion (SAD or SATD)
  • $R$ = Bit rate for encoding motion vector
  • $\lambda$ = Lagrangian multiplier

Fast Math Approximations (RSQRTPS)

Instruction: RSQRTPS (Reciprocal Square Root of Packed Single-Precision Floating-Point Values)

File: RSQRTPS.asm

Computes approximate reciprocal square root:

$$y \approx \frac{1}{\sqrt{x}}$$

With relative error $|\epsilon| &lt; 1.5 \times 2^{-12} \approx 0.00037$.

Newton-Raphson refinement:

For $f(y) = \frac{1}{y^2} - x = 0$, Newton's method gives:

$$y_{n+1} = y_n \cdot \left(\frac{3}{2} - \frac{x}{2} \cdot y_n^2\right)$$

Or equivalently:

$$y_{n+1} = \frac{y_n}{2} \cdot \left(3 - x \cdot y_n^2\right)$$

Implementation with one NR iteration:

fast_rsqrt_nr:
    ; xmm0 = input x
    rsqrtps xmm1, xmm0           ; y0 ≈ 1/sqrt(x)
    
    ; Newton-Raphson: y1 = 0.5 * y0 * (3 - x * y0^2)
    movaps xmm2, xmm1
    mulps xmm2, xmm1             ; y0^2
    mulps xmm2, xmm0             ; x * y0^2
    movaps xmm3, [three]         ; 3.0
    subps xmm3, xmm2             ; 3 - x*y0^2
    mulps xmm3, xmm1             ; y0 * (3 - x*y0^2)
    mulps xmm3, [half]           ; 0.5 * y0 * (3 - x*y0^2)
    
    ; xmm3 = refined result (error < 2^-23)
    ret

Error analysis:

Method Max Relative Error Latency (cycles)
RSQRTPS only $1.5 \times 2^{-12}$ 4
RSQRTPS + 1 NR $&lt; 2^{-22}$ 14
RSQRTPS + 2 NR $&lt; 2^{-44}$ 24
SQRTPS + DIVPS $&lt; 0.5$ ulp 25

Vector normalisation:

For vector $\mathbf{v} = (x, y, z)$:

$$\hat{\mathbf{v}} = \frac{\mathbf{v}}{|\mathbf{v}|} = \mathbf{v} \cdot \frac{1}{\sqrt{x^2 + y^2 + z^2}}$$

Vector Permutation (VPERM2B)

Instruction: VPERM2B (Full Permute of Bytes from Two Tables)

File: VPERM2B.asm

Performs arbitrary byte-level permutation using indices from two source registers:

$$\text{Dest}[i] = \begin{cases} \text{Src1}[\text{Idx}[i] \mod 64] & \text{if } \text{Idx}[i] < 128 \ \text{Src2}[\text{Idx}[i] \mod 64] & \text{if } \text{Idx}[i] \geq 128 \end{cases}$$

Applications:

  • AES T-table lookups
  • Base64 encoding/decoding
  • LUT-based S-boxes
  • Data shuffling

Example: Byte reversal within 64-bit lanes:

byte_reverse_64:
    ; Control indices for reversing bytes within each qword
    vmovdqu64 zmm2, [reverse_indices]
    ; reverse_indices: 7,6,5,4,3,2,1,0, 15,14,13,12,11,10,9,8, ...
    
    vperm2b zmm0, zmm1, zmm2
    ret

FFT & Convolution Primitives

Radix-2 Cooley-Tukey FFT butterfly:

$$\begin{aligned} X[k] &= E[k] + W_N^k \cdot O[k] \\ X[k + N/2] &= E[k] - W_N^k \cdot O[k] \end{aligned}$$

Where $W_N^k = e^{-2\pi i k/N}$ is the twiddle factor.

Vectorised butterfly (4 complex numbers):

fft_butterfly_avx:
    ; ymm0 = [e0r, e0i, e1r, e1i, e2r, e2i, e3r, e3i] (even)
    ; ymm1 = [o0r, o0i, o1r, o1i, o2r, o2i, o3r, o3i] (odd)
    ; ymm2 = [w0r, w0i, w1r, w1i, w2r, w2i, w3r, w3i] (twiddles)
    
    ; Complex multiply: (a+bi)(c+di) = (ac-bd) + (ad+bc)i
    vpermilps ymm3, ymm1, 0xB1    ; [o0i, o0r, ...]
    vmulps ymm4, ymm1, ymm2       ; [o*wr, o*wi, ...]
    vfmaddsub231ps ymm4, ymm3, ymm2  ; Complex product
    
    ; Butterfly
    vaddps ymm5, ymm0, ymm4       ; E + W*O
    vsubps ymm6, ymm0, ymm4       ; E - W*O
    ret

Convolution theorem:

$$h * x = \mathcal{F}^{-1}{\mathcal{F}{h} \cdot \mathcal{F}{x}}$$

Complexity: $O(n \log n)$ vs. direct convolution $O(n^2)$


Honorable Mentions: The Unsung Heroes

This section provides comprehensive coverage of critical instructions that don't fit neatly into the categories above but are essential for systems programming.

Arbitrary Precision Arithmetic (ADCX/ADOX)

Instructions: ADCX (Add with Carry Flag) and ADOX (Add with Overflow Flag)

These instructions enable dual carry chains for efficient big-integer arithmetic:

$$\text{ADCX: } \text{dest} = \text{dest} + \text{src} + \text{CF}, \quad \text{CF} \leftarrow \text{carry}$$ $$\text{ADOX: } \text{dest} = \text{dest} + \text{src} + \text{OF}, \quad \text{OF} \leftarrow \text{carry}$$

Why two carry chains matter:

In traditional multi-precision addition, each ADD depends on the previous carry:

Traditional (serial):
   ADD r0, a0, b0  ; CF = carry0
   ADC r1, a1, b1  ; needs carry0, CF = carry1
   ADC r2, a2, b2  ; needs carry1, CF = carry2
   ...

With ADCX/ADOX, two independent additions can proceed in parallel:

Parallel carry chains:
   ADCX r0, a0     ; CF chain
   ADOX r4, c0     ; OF chain (independent!)
   ADCX r1, a1     ; CF chain continues
   ADOX r5, c1     ; OF chain continues

RSA Montgomery multiplication:

The Montgomery reduction algorithm computes:

$$\text{MonPro}(a, b) = a \cdot b \cdot R^{-1} \mod N$$

Where $R = 2^k$ for k-bit modulus. ADCX/ADOX accelerate the inner loop:

montmul_inner:
    ; Compute partial product and accumulate
    mulx rdx, rax, [b + rcx*8]  ; rdx:rax = a[i] * b[j]
    adcx r8, rax                 ; Accumulate low (CF chain)
    adox r9, rdx                 ; Accumulate high (OF chain)
    ; Continue for all limbs...

Performance for 4096-bit RSA:

xychart-beta
    title "RSA-4096 Performance (ops/second)"
    x-axis ["Scalar ADD", "ADCX/ADOX", "AVX-512 IFMA"]
    y-axis "Operations/s" 0 --> 2000
    bar [450, 1200, 1850]
Loading

Cycle-Accurate Timing (RDTSC/RDTSCP)

Instructions: RDTSC (Read Time-Stamp Counter) and RDTSCP (with Processor ID)

The Time-Stamp Counter is a 64-bit register counting reference cycles since reset:

$$\text{TSC}_{current} = \text{TSC}_{boot} + \text{RefCycles}_{elapsed}$$

Precise timing template:

benchmark_function:
    ; Serialise to prevent out-of-order execution
    cpuid
    rdtsc
    shl rdx, 32
    or rax, rdx
    mov r12, rax          ; Start timestamp
    
    ; === Code to benchmark ===
    call target_function
    ; =========================
    
    ; Serialise again
    rdtscp
    shl rdx, 32
    or rax, rdx
    sub rax, r12          ; Elapsed cycles
    ret

TSC frequency calculation:

On modern processors, TSC runs at the "nominal" frequency:

$$f_{TSC} \approx f_{base} \text{ (invariant TSC)}$$

To convert cycles to nanoseconds:

$$t_{ns} = \frac{\Delta\text{TSC} \times 10^9}{f_{TSC}}$$

Caveats:

  • Older CPUs: TSC varies with frequency scaling
  • Multi-socket: TSC may differ between sockets
  • VM environments: TSC may be virtualised

Population Count (POPCNT)

Instruction: POPCNT (Population Count)

Counts the number of set bits (Hamming weight):

$$\text{POPCNT}(x) = \sum_{i=0}^{63} x_i$$

Applications:

  1. Chess engines (bitboards):

    Knights attack mask: popcount(knight_attacks & enemy_pieces)
    
  2. Bioinformatics (Hamming distance): $$d_H(x, y) = \text{POPCNT}(x \oplus y)$$

  3. Bloom filters: $$\text{Fill ratio} = \frac{\text{POPCNT}(\text{filter})}{\text{size}}$$

  4. Set cardinality in bitmaps:

    size_t count = 0;
    for (size_t i = 0; i < n; i++)
        count += __builtin_popcountll(bitmap[i]);

Vectorised population count (AVX-512 VPOPCNTDQ):

; Count bits in 8 qwords simultaneously
vpopcntq zmm1, zmm0
; Each qword in zmm1 contains popcount of corresponding qword in zmm0

Intentional Undefined (UD2)

Instruction: UD2 (Undefined Instruction)

Intentionally raises an Invalid Opcode Exception (#UD):

Opcode: 0F 0B
Behavior: #UD exception (Interrupt 6)

Use cases:

  1. Compiler trap for unreachable code:

    __builtin_unreachable();  // Often compiles to UD2
  2. Debug assertions:

    if (!condition) __builtin_trap();  // UD2
  3. Kernel panic/halt:

    kernel_panic:
        cli
        ud2
        jmp kernel_panic
  4. Marking invalid code paths:

    ; After noreturn function
    call exit
    ud2              ; Should never execute

Hardware Random Numbers (RDRAND/RDSEED)

Instructions:

  • RDRAND: Get random number from DRNG (NIST SP 800-90A compliant)
  • RDSEED: Get entropy directly from hardware source

Entropy source:

Intel's on-die Digital Random Number Generator uses:

  • Thermal noise from transistor metastability
  • AES-CBC-MAC conditioning
  • CTR_DRBG for RDRAND output

$$\text{RDSEED} \rightarrow \text{Entropy Pool} \xrightarrow{\text{CTR_DRBG}} \text{RDRAND}$$

Usage pattern:

get_secure_random:
    xor ecx, ecx
.retry:
    rdrand rax
    jc .success         ; CF=1 means valid random
    inc ecx
    cmp ecx, 10
    jl .retry
    xor eax, eax        ; Failed after 10 attempts
    ret
.success:
    ; rax contains 64 bits of random data
    ret

RDSEED vs RDRAND:

Property RDRAND RDSEED
Source DRBG output Raw entropy
Rate Higher (several GB/s) Lower (limited by entropy)
Failure rate Very low Higher under heavy load
Use case General random Seeding other PRNGs

Transactional Memory (XBEGIN/XEND)

Instructions:

  • XBEGIN rel32: Begin transaction (branch to rel32 on abort)
  • XEND: Commit transaction
  • XABORT imm8: Explicitly abort with reason code

Transactional execution model:

stateDiagram-v2
    [*] --> Transactional: XBEGIN
    Transactional --> Committed: XEND
    Transactional --> Aborted: Conflict/XABORT
    Committed --> [*]
    Aborted --> FallbackPath
    FallbackPath --> [*]
Loading

Lock elision pattern:

transaction_update:
    xacquire lock add [lock_var], 1  ; HLE prefix (deprecated)
    ; Or with RTM:
    xbegin .fallback
    
    ; Transactional region - no locks needed!
    mov rax, [shared_data]
    add rax, 1
    mov [shared_data], rax
    
    xend
    ret
    
.fallback:
    ; Transaction aborted - use traditional locking
    lock add [shared_data], 1
    ret

Abort codes (returned in EAX):

Bit Meaning
0 Abort caused by XABORT
1 Transaction may succeed on retry
2 Conflict with another processor
3 Buffer overflow
4 Debug breakpoint hit
5 Nested transaction abort
23:24 XABORT immediate value

Current status: Intel TSX has been disabled on many processors due to security vulnerabilities (TAA, ZombieLoad).

AES Acceleration (AESENC/AESDEC)

Instructions:

  • AESENC xmm, xmm/m128: One AES encryption round
  • AESENCLAST xmm, xmm/m128: Final AES encryption round
  • AESDEC xmm, xmm/m128: One AES decryption round
  • AESDECLAST xmm, xmm/m128: Final AES decryption round
  • AESKEYGENASSIST xmm, xmm/m128, imm8: Key expansion assistance

AES round structure:

Each AESENC performs:

  1. SubBytes (S-box substitution)
  2. ShiftRows (cyclic row shifts)
  3. MixColumns (column mixing)
  4. AddRoundKey (XOR with round key)

$$\text{AESENC}(\text{state}, \text{key}) = \text{AddRoundKey}(\text{MixColumns}(\text{ShiftRows}(\text{SubBytes}(\text{state}))), \text{key})$$

Complete AES-128 encryption:

aes128_encrypt:
    ; xmm0 = plaintext, xmm1-xmm11 = round keys
    pxor xmm0, xmm1           ; Round 0: AddRoundKey only
    aesenc xmm0, xmm2         ; Round 1
    aesenc xmm0, xmm3         ; Round 2
    aesenc xmm0, xmm4         ; Round 3
    aesenc xmm0, xmm5         ; Round 4
    aesenc xmm0, xmm6         ; Round 5
    aesenc xmm0, xmm7         ; Round 6
    aesenc xmm0, xmm8         ; Round 7
    aesenc xmm0, xmm9         ; Round 8
    aesenc xmm0, xmm10        ; Round 9
    aesenclast xmm0, xmm11    ; Round 10 (no MixColumns)
    ; xmm0 = ciphertext
    ret

Performance (cycles per byte):

Implementation AES-128 AES-256
Software table ~20 cpb ~25 cpb
AES-NI ~0.7 cpb ~0.9 cpb
AES-NI + pipelining ~0.4 cpb ~0.5 cpb

Cache Control (CLFLUSH/CLFLUSHOPT/CLWB)

Instructions:

  • CLFLUSH m8: Flush cache line (ordered, slow)
  • CLFLUSHOPT m8: Flush cache line (weakly ordered, fast)
  • CLWB m8: Write back cache line (keep in cache)

Cache hierarchy:

┌─────────────┐
│   L1 Data   │  32-64 KB, ~4 cycles
├─────────────┤
│     L2      │  256 KB - 1 MB, ~12 cycles
├─────────────┤
│     L3      │  8-64 MB, ~40 cycles
├─────────────┤
│    DRAM     │  ~200 cycles
└─────────────┘

Use cases:

  1. Persistent memory (PMEM) programming:

    ; Ensure data reaches persistent storage
    mov [pmem_addr], rax
    clwb [pmem_addr]
    sfence
  2. DMA coherency:

    ; Flush before DMA read from device
    clflush [dma_buffer]
    mfence
    ; Now safe for device to read
  3. Security (defense against cache timing attacks):

    ; Clear sensitive data from cache
    clflush [secret_key]
    clflush [secret_key + 64]
  4. Rowhammer research/exploitation:

    .hammer:
        mov rax, [addr1]
        clflush [addr1]
        mov rbx, [addr2]
        clflush [addr2]
        mfence
        jmp .hammer

Byte Swap (BSWAP)

Instruction: BSWAP reg32/64

Reverses byte order for endianness conversion:

$$\text{BSWAP}(0x\text{AABBCCDD}) = 0x\text{DDCCBBAA}$$

32-bit transformation:

Before: [B₃, B₂, B₁, B₀] (little-endian)
After:  [B₀, B₁, B₂, B₃] (big-endian)

Network byte order conversion:

; htonl equivalent
htonl:
    bswap edi
    mov eax, edi
    ret

; ntohll (64-bit)
ntohll:
    bswap rdi
    mov rax, rdi
    ret

SIMD byte reversal (for vectors):

bswap_16bytes:
    ; xmm0 = input
    ; Reverse bytes using PSHUFB
    movdqa xmm1, [bswap_mask]  ; 15,14,13,...,2,1,0
    pshufb xmm0, xmm1
    ret

Cache Invalidation (INVD/WBINVD)

Instructions:

  • INVD: Invalidate caches WITHOUT writeback (Ring 0 only)
  • WBINVD: Write back AND invalidate all caches (Ring 0 only)

⚠️ DANGER LEVEL: EXTREME ⚠️

INVD sequence:
1. All modified cache lines are DISCARDED
2. Memory state becomes inconsistent
3. System may crash or corrupt data
4. Only use in very specific firmware scenarios

Legitimate use cases:

  1. BIOS/UEFI initialisation:

    ; Before enabling caching for the first time
    wbinvd
    ; Configure MTRRs
    ; Enable caching
  2. Cache-as-RAM (CAR) teardown:

    ; After copying CAR data to real RAM
    wbinvd
    ; Reconfigure caching
  3. Hardware reset preparation:

    ; Ensure memory is consistent before reset
    wbinvd
    cli
    hlt

Bit Manipulation (BMI1/BMI2)

BMI1 instructions:

  • ANDN r, r, r/m: $\text{dest} = (\lnot \text{src1}) \land \text{src2}$
  • BEXTR r, r/m, r: Extract bit field
  • BLSI r, r/m: Isolate lowest set bit: $\text{dest} = \text{src} \land (-\text{src})$
  • BLSMSK r, r/m: Mask up to lowest set bit: $\text{dest} = \text{src} \oplus (\text{src} - 1)$
  • BLSR r, r/m: Reset lowest set bit: $\text{dest} = \text{src} \land (\text{src} - 1)$
  • TZCNT r, r/m: Count trailing zeros

BMI2 instructions:

  • BZHI r, r/m, r: Zero high bits starting at position
  • MULX r, r, r/m: Unsigned multiply without affecting flags
  • PDEP r, r, r/m: Parallel bit deposit
  • PEXT r, r, r/m: Parallel bit extract
  • RORX r, r/m, imm8: Rotate right without affecting flags
  • SARX/SHLX/SHRX r, r/m, r: Shifts without affecting flags

PDEP/PEXT visualisation:

PEXT (extract):
  Source:  1 0 1 1 0 1 1 0
  Mask:    1 0 1 0 0 1 0 0
  Result:  0 0 0 0 0 1 1 1  (extracted bits: 1,1,1)

PDEP (deposit):
  Source:  0 0 0 0 0 1 0 1
  Mask:    1 0 1 0 0 1 0 0
  Result:  1 0 0 0 0 1 0 0  (deposited at mask positions)

Chess move generation with PEXT:

generate_rook_attacks:
    ; rdi = square, rsi = occupied bitboard
    mov rax, [rook_masks + rdi*8]   ; Get relevant occupancy mask
    pext rax, rsi, rax               ; Extract relevant bits
    mov rcx, [rook_attacks + rdi*8]  ; Get attack table base
    mov rax, [rcx + rax*8]           ; Look up attacks
    ret

CRC32 Acceleration

Instruction: CRC32 r, r/m

Hardware-accelerated CRC-32C (Castagnoli polynomial):

$$P(x) = x^{32} + x^{28} + x^{27} + x^{26} + x^{25} + x^{23} + x^{22} + x^{20} + x^{19} + x^{18} + x^{14} + x^{13} + x^{11} + x^{10} + x^{9} + x^{8} + x^{6} + 1$$

Polynomial representation: 0x1EDC6F41 (iSCSI, ext4, Btrfs)

Usage:

crc32c_block:
    ; rdi = buffer, rsi = length
    xor eax, eax          ; Initialise CRC to 0 (or ~0 for standard)
    
.loop_8:
    cmp rsi, 8
    jl .loop_1
    crc32 rax, qword [rdi]
    add rdi, 8
    sub rsi, 8
    jmp .loop_8
    
.loop_1:
    test rsi, rsi
    jz .done
    crc32 rax, byte [rdi]
    inc rdi
    dec rsi
    jmp .loop_1
    
.done:
    ; Finalise: return ~crc for standard CRC-32C
    not eax
    ret

Performance comparison:

Method Throughput
Table-driven (8-bit) ~500 MB/s
Slicing-by-8 ~2 GB/s
CRC32 instruction ~20 GB/s
PCLMULQDQ folding ~40 GB/s

Performance Analysis & Microarchitecture

Instruction Latency Tables

Key instructions across microarchitectures (cycles):

Instruction Skylake Ice Lake Zen 3 Zen 4
VADDPS (256) 4 4 3 3
VMULPS (256) 4 4 3 3
VFMADD (256) 4 4 4 4
VDIVPS (256) 11 11 10 10
VSQRTPS (256) 12 12 12 13
VPERM2B (512) 3 3 - 4
VPDPBUSD (512) 5 5 - 4
PCLMULQDQ 7 6 4 3
AESENC 4 4 4 3

Extended latency table (additional instructions):

Instruction Skylake Ice Lake Zen 3 Zen 4 Description
VADDPS (512) 4 4 - 3 FP32 vector add
VMULPS (512) 4 4 - 3 FP32 vector multiply
VFMADD (512) 4 4 - 4 Fused multiply-add
VDIVPS (512) 18 11 - 10 FP32 vector divide
VSQRTPS (512) 20 19 - 13 FP32 vector sqrt
VPAND (256) 1 1 1 1 Bitwise AND
VPOR (256) 1 1 1 1 Bitwise OR
VPXOR (256) 1 1 1 1 Bitwise XOR
VPSHUFB (256) 1 1 1 1 Byte shuffle
VPERMD (256) 3 3 2 2 Dword permute
VPERMQ (256) 3 3 2 2 Qword permute
VGATHERDPS (256) 22 15 12 10 Gather (best case)
VSCATTERDPS (512) 40 20 - 15 Scatter (best case)
POPCNT 3 3 1 1 Population count
LZCNT 3 3 1 1 Leading zero count
TZCNT 3 3 1 1 Trailing zero count
PDEP 3 3 18* 3 Parallel bit deposit
PEXT 3 3 18* 3 Parallel bit extract
CRC32 3 3 3 3 CRC acceleration
RDTSC 20 15 9 8 Read timestamp
RDRAND ~400 ~200 ~1200 ~800 Hardware random

*Note: Zen 1-3 implement PDEP/PEXT via microcode, resulting in high latency. Zen 4 has native hardware support.

Port Utilisation

Skylake/Ice Lake execution ports:

graph TD
    subgraph "Front End (4-wide decode)"
        IF[Instruction Fetch] --> ID[Instruction Decode]
        ID --> UOP[µop Queue]
        UOP --> RAT[Register Alias Table]
        RAT --> ROB[Reorder Buffer<br/>224 entries]
    end
    
    subgraph "Execution Engine (8 ports)"
        ROB --> RS[Reservation Station<br/>97 entries]
        RS --> P0[Port 0<br/>ALU, FMA<br/>DIV, SQRT]
        RS --> P1[Port 1<br/>ALU, FMA<br/>AES, CLMUL]
        RS --> P2[Port 2<br/>Load<br/>AGU]
        RS --> P3[Port 3<br/>Load<br/>AGU]
        RS --> P4[Port 4<br/>Store Data]
        RS --> P5[Port 5<br/>ALU, Shuffle<br/>Branch]
        RS --> P6[Port 6<br/>ALU<br/>Branch]
        RS --> P7[Port 7<br/>Store AGU]
    end
    
    subgraph "Memory Subsystem"
        P2 --> L1D[L1 Data Cache<br/>32KB, 8-way]
        P3 --> L1D
        P4 --> SB[Store Buffer<br/>56 entries]
        SB --> L1D
        L1D --> L2[L2 Cache<br/>256KB-1MB]
        L2 --> L3[L3 Cache<br/>Shared]
    end
Loading

Port assignment for key SIMD operations:

Operation Category Ports Throughput
FP Add (256-bit) P0, P1 2/cycle
FP Mul (256-bit) P0, P1 2/cycle
FP FMA (256-bit) P0, P1 2/cycle
FP Add (512-bit) P0, P1 1/cycle
FP Mul (512-bit) P0, P1 1/cycle
FP FMA (512-bit) P0, P1 1/cycle
FP Div (any) P0 1/3-18 cycles
FP Sqrt (any) P0 1/4-20 cycles
Integer Add/Logic P0, P1, P5 3/cycle
Shuffle (256-bit) P5 1/cycle
Shuffle (512-bit) P5 0.5/cycle
AES round P0 1/cycle
CLMUL P0 1/cycle
Load (256-bit) P2, P3 2/cycle
Load (512-bit) P2+P3 1/cycle
Store (256-bit) P4+P7 1/cycle
Store (512-bit) P4+P7 1/cycle

Throughput Optimisation

Reciprocal throughput vs. latency:

Understanding the difference is critical for optimisation:

  • Latency: Cycles from input ready to output ready (single instruction)
  • Throughput: How often an instruction can start (reciprocal = cycles between starts)

$$\text{Effective IPC} = \min\left(\frac{1}{\text{Reciprocal Throughput}}, \frac{\text{Pipeline Depth}}{\text{Latency}}\right)$$

Example: FMA throughput analysis:

Latency = 4 cycles
Throughput = 2 ops/cycle (ports P0 + P1)
Reciprocal Throughput = 0.5 cycles

To saturate both FMA units:
- Need 4 × 2 = 8 independent FMA chains
- Each chain must have 4 instructions in flight

Throughput comparison chart:

xychart-beta
    title "Peak SIMD Throughput (GFLOPS @ 3GHz)"
    x-axis ["SSE (128)", "AVX (256)", "AVX-512", "AVX-512 VNNI"]
    y-axis "GFLOPS" 0 --> 400
    bar [48, 96, 192, 384]
Loading

Calculating theoretical peak:

$$\text{GFLOPS}_{\text{peak}} = \frac{\text{Vector Width (bits)}}{32} \times \text{FMA ops/cycle} \times 2 \times f_{\text{GHz}}$$

For AVX-512 at 3 GHz with 2 FMA units: $$\text{GFLOPS} = \frac{512}{32} \times 2 \times 2 \times 3 = 192 \text{ GFLOPS}$$

Microarchitectural Hazards

1. Port Contention:

When multiple instructions compete for the same execution port:

; BAD: All three need Port 0
vdivps ymm0, ymm1, ymm2    ; P0
vsqrtps ymm3, ymm4          ; P0
vrcpps ymm5, ymm6           ; P0
; Serialised execution!

; BETTER: Interleave with different ports
vdivps ymm0, ymm1, ymm2    ; P0 (starts)
vaddps ymm7, ymm8, ymm9    ; P0 or P1
vpshufb ymm10, ymm11, ymm12 ; P5
vmulps ymm13, ymm14, ymm15  ; P0 or P1
; Better utilisation

2. Register Dependencies:

Read-after-write (RAW) hazards create stalls:

; Latency-bound chain (16 cycles total)
vfmadd231ps ymm0, ymm1, ymm2   ; 4 cycles
vfmadd231ps ymm0, ymm3, ymm4   ; waits for ymm0
vfmadd231ps ymm0, ymm5, ymm6   ; waits for ymm0
vfmadd231ps ymm0, ymm7, ymm8   ; waits for ymm0

; Throughput-optimised (parallel, 4 cycles + reduce)
vfmadd231ps ymm0, ymm1, ymm2   ; 4 independent chains
vfmadd231ps ymm10, ymm3, ymm4
vfmadd231ps ymm11, ymm5, ymm6
vfmadd231ps ymm12, ymm7, ymm8
vaddps ymm0, ymm0, ymm10       ; Reduce at end
vaddps ymm11, ymm11, ymm12
vaddps ymm0, ymm0, ymm11

3. Memory Bandwidth Limits:

$$\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Transferred}}$$

For memory-bound code: $$\text{Achievable GFLOPS} = \text{Memory BW (GB/s)} \times \text{Arithmetic Intensity}$$

xychart-beta
    title "Roofline Model (DDR4-3200, 50 GB/s)"
    x-axis "Arithmetic Intensity (FLOP/Byte)" 0 --> 20
    y-axis "GFLOPS" 0 --> 200
    line [0, 50, 100, 150, 192, 192, 192, 192, 192, 192]
Loading

4. AVX-512 Frequency Throttling:

Heavy AVX-512 usage causes frequency reduction on some Intel CPUs:

License Level Trigger Frequency Reduction
L0 (Normal) Scalar/SSE 0%
L1 (Light AVX) AVX/AVX2 ~3-5%
L2 (Heavy AVX-512) AVX-512 (FP) ~10-20%

Mitigation strategies:

  • Use AVX2 for short bursts
  • Batch AVX-512 work to amortize transition overhead
  • Profile actual workload performance, not microbenchmarks

Cache Behavior Analysis

Cache line utilisation:

$$\text{Utilisation} = \frac{\text{Bytes Actually Used}}{\text{Cache Lines Fetched} \times 64}$$

Example: Stride access patterns:

Stride Utilisation Effective BW
4 bytes (sequential float) 100% 50 GB/s
8 bytes (every other) 50% 25 GB/s
64 bytes (1 per line) 6.25% 3.1 GB/s
4096 bytes (1 per page) 0.1% 50 MB/s

Prefetch distance calculation:

$$\text{Prefetch Distance} = \frac{\text{Memory Latency (cycles)}}{\text{Cycles per Iteration}} \times \text{Bytes per Iteration}$$

For 200-cycle memory latency, 4-cycle loop, 64 bytes/iteration: $$\text{Distance} = \frac{200}{4} \times 64 = 3200 \text{ bytes} = 50 \text{ cache lines}$$

Instruction Fusion and Micro-op Analysis

Macro-op fusion (x86-64):

Certain instruction pairs fuse into single µops:

Pattern Fused? Example
CMP + JCC cmp rax, rbx; jne label
TEST + JCC test rax, rax; jz label
ADD + JCC ✓ (Intel) add rax, 1; jnz label
AND + JCC and rax, rbx; jz label

Micro-op counts for common SIMD:

Instruction µops (Skylake) µops (Zen 3)
VADDPS ymm 1 1
VFMADD231PS ymm 1 1
VDIVPS ymm 1 1
VPSHUFB ymm 1 1
VGATHERDPS ymm 5 4
VSCATTERDPS zmm 16+ N/A
VPTERNLOGD zmm 1 N/A
VPERMI2D zmm 3 N/A

Branch Prediction and SIMD

Avoiding branches with SIMD:

; Scalar with branch (misprediction penalty ~15 cycles)
.loop:
    mov eax, [rsi]
    cmp eax, 0
    jl .negative
    mov [rdi], eax
    jmp .next
.negative:
    neg eax
    mov [rdi], eax
.next:
    ; ...

; Branchless SIMD (absolute value)
    vmovdqu ymm0, [rsi]
    vpabsd ymm0, ymm0          ; No branches!
    vmovdqu [rdi], ymm0

Blend-based conditional:

$$\text{Result}_i = \text{Mask}_i ,?, A_i : B_i$$

; Conditional max(a, b) without branches
vcmpps ymm2, ymm0, ymm1, 1    ; ymm2 = (ymm0 < ymm1) ? 0xFF : 0x00
vblendvps ymm3, ymm0, ymm1, ymm2  ; Select larger values

Optimisation Patterns and Anti-Patterns

Pattern 1: Loop Unrolling

Calculate optimal unroll factor:

$$U_{opt} = \max\left(\frac{\text{Latency}}{\text{Throughput}}, \frac{\text{Available Regs}}{2}\right)$$

; 4x unrolled dot product
dot_product_unrolled:
    vxorps ymm0, ymm0, ymm0    ; Accumulator 0
    vxorps ymm1, ymm1, ymm1    ; Accumulator 1
    vxorps ymm2, ymm2, ymm2    ; Accumulator 2
    vxorps ymm3, ymm3, ymm3    ; Accumulator 3
    
.loop:
    vmovups ymm4, [rdi]
    vmovups ymm5, [rdi + 32]
    vmovups ymm6, [rdi + 64]
    vmovups ymm7, [rdi + 96]
    
    vfmadd231ps ymm0, ymm4, [rsi]
    vfmadd231ps ymm1, ymm5, [rsi + 32]
    vfmadd231ps ymm2, ymm6, [rsi + 64]
    vfmadd231ps ymm3, ymm7, [rsi + 96]
    
    add rdi, 128
    add rsi, 128
    sub rcx, 32
    jg .loop
    
    ; Reduce 4 accumulators
    vaddps ymm0, ymm0, ymm1
    vaddps ymm2, ymm2, ymm3
    vaddps ymm0, ymm0, ymm2
    ; Horizontal sum of ymm0...
    ret

Pattern 2: Software Pipelining

Overlap loads with computation:

; Pipelined: load next while computing current
.prolog:
    vmovups ymm8, [rdi]        ; Load iteration 0
    add rdi, 32
    
.loop:
    vmovups ymm9, [rdi]        ; Load iteration N+1
    vfmadd231ps ymm0, ymm8, [rsi]  ; Compute iteration N
    vmovaps ymm8, ymm9         ; Prepare for next
    add rdi, 32
    add rsi, 32
    sub rcx, 8
    jg .loop
    
.epilog:
    vfmadd231ps ymm0, ymm8, [rsi]  ; Final iteration

Pattern 3: Horizontal Reduction

Efficient horizontal sum (256-bit):

horizontal_sum_256:
    ; Input: ymm0 = [a, b, c, d, e, f, g, h]
    vextractf128 xmm1, ymm0, 1     ; xmm1 = [e, f, g, h]
    vaddps xmm0, xmm0, xmm1        ; xmm0 = [a+e, b+f, c+g, d+h]
    vhaddps xmm0, xmm0, xmm0       ; xmm0 = [a+e+b+f, c+g+d+h, ...]
    vhaddps xmm0, xmm0, xmm0       ; xmm0 = [sum, sum, sum, sum]
    ; Result in xmm0[0]
    ret

Efficient horizontal sum (512-bit):

horizontal_sum_512:
    ; Input: zmm0
    vextractf64x4 ymm1, zmm0, 1
    vaddps ymm0, ymm0, ymm1
    vextractf128 xmm1, ymm0, 1
    vaddps xmm0, xmm0, xmm1
    vhaddps xmm0, xmm0, xmm0
    vhaddps xmm0, xmm0, xmm0
    ret

Anti-Pattern 1: Unnecessary Zeroing

; BAD: Zeroing upper bits unnecessarily
vmovss xmm0, [rdi]             ; Zeros upper bits (SSE)
vaddss xmm0, xmm0, xmm1        ; But we just need scalar

; BETTER: Use VEX-encoded scalar ops
vaddss xmm0, xmm1, [rdi]       ; Single instruction

Anti-Pattern 2: Lane Crossing Penalties

; EXPENSIVE: Cross-lane shuffle in AVX2 (3-cycle latency)
vpermd ymm0, ymm1, ymm2

; CHEAPER: In-lane operations when possible (1-cycle latency)
vpshufb ymm0, ymm1, ymm2       ; Within 128-bit lanes

Anti-Pattern 3: Gather/Scatter Overuse

; SLOW: Gather for contiguous-ish data
vgatherdps ymm0, [rdi + ymm1*4], ymm2  ; 15+ cycles

; FASTER: Load + permute for predictable patterns
vmovups ymm0, [rdi]
vpermd ymm0, ymm3, ymm0        ; 4-5 cycles total

Debugging and Profiling SIMD Code

Intel VTune Markers

#include <ittnotify.h>

void simd_kernel() {
    __itt_task_begin(__itt_domain_create("SIMD"), __itt_null, __itt_null,
                     __itt_string_handle_create("kernel"));
    
    // SIMD code here
    
    __itt_task_end(__itt_domain_create("SIMD"));
}

Performance Counters

Key events to monitor:

Counter Description Target
FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE 256-bit FP ops Maximise
FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE 512-bit FP ops Maximise
UOPS_DISPATCHED_PORT.PORT_0 Port 0 usage Balance
UOPS_DISPATCHED_PORT.PORT_1 Port 1 usage Balance
UOPS_DISPATCHED_PORT.PORT_5 Port 5 usage Balance
MEM_LOAD_RETIRED.L1_MISS L1 cache misses Minimise
MEM_LOAD_RETIRED.L3_MISS L3 cache misses Minimise
MACHINE_CLEARS.COUNT Pipeline clears Zero

Using perf on Linux:

perf stat -e fp_arith_inst_retired.256b_packed_single,\
fp_arith_inst_retired.512b_packed_single,\
cycles,instructions ./my_simd_program

Common Issues Checklist

Symptom Likely Cause Solution
Low IPC Dependency chains Unroll, use more accumulators
High L1 misses Poor spatial locality Prefetch, tile loops
Port 0 saturated Too many DIV/SQRT Approximate or interleave
Frequency drop Heavy AVX-512 Consider AVX2, batch work
Branch mispredicts Data-dependent branches Use SIMD blend/mask

References & Manuals

Primary References

  1. Intel® 64 and IA-32 Architectures Software Developer's Manual

    • Volume 1: Basic Architecture
    • Volume 2: Instruction Set Reference
    • Volume 3: System Programming Guide
    • Intel SDM
  2. Intel® Intrinsics Guide

  3. AMD64 Architecture Programmer's Manual

  4. Agner Fog's Optimisation Manuals

    • Instruction Tables
    • Microarchitecture Guide
    • Optimising Assembly
    • Optimising C++
    • Agner's Site

Microarchitecture Resources

  1. uops.info

    • Detailed instruction latency/throughput data
    • Port usage information
    • uops.info
  2. WikiChip

    • CPU microarchitecture details
    • WikiChip
  3. Chips and Cheese

Tools

  1. Compiler Explorer (Godbolt)

  2. LLVM-MCA (Machine Code Analyzer)

    • Static analysis of assembly
    • llvm-mca -mcpu=skylake < code.s
  3. Intel SDE (Software Development Emulator)

Academic Papers

  1. SIMD Vectorisation:

    • "Automatic SIMD Vectorisation of SSA-based Control Flow Graphs" (CGO 2011)
    • "Exploiting Superword Level Parallelism with Multimedia Instruction Sets" (PLDI 2000)
  2. Performance Modeling:

    • "Roofline: An Insightful Visual Performance Model" (CACM 2009)
    • "A Stochastic Model for Superscalar Execution" (IEEE TC 2018)

Appendix A: Quick Reference Cards

SSE/AVX Register Naming

XMM0-XMM15:  128-bit (SSE, AVX lower half)
YMM0-YMM15:  256-bit (AVX, AVX2)
ZMM0-ZMM31:  512-bit (AVX-512)
K0-K7:       Opmask registers (AVX-512)

Common Instruction Prefixes

Prefix Meaning Example
V VEX/EVEX encoded VADDPS vs ADDPS
P Packed PMULLD
S Scalar ADDSS
B Byte PSHUFB
W Word PMULLW
D Dword PADDD
Q Qword PADDQ
H Horizontal HADDPS
U Unaligned MOVUPS
A Aligned MOVAPS
NT Non-temporal MOVNTPS

Intrinsic Naming Convention

_mm256_add_ps
  │    │   │
  │    │   └─ Element type (ps=packed single, pd=packed double, 
  │    │                     epi32=packed int32, si256=256-bit int)
  │    └───── Operation
  └────────── Register width (_mm=128, _mm256=256, _mm512=512)

Appendix B: Instruction Set Detection

#include <cpuid.h>
#include <stdbool.h>

typedef struct {
    bool sse, sse2, sse3, ssse3, sse41, sse42;
    bool avx, avx2;
    bool avx512f, avx512bw, avx512dq, avx512vl;
    bool avx512vnni, avx512vbmi, avx512ifma;
    bool bmi1, bmi2, popcnt, aesni, pclmul;
    bool fma, f16c, rdrand, rdseed;
} CPUFeatures;

CPUFeatures detect_features(void) {
    CPUFeatures f = {0};
    unsigned int eax, ebx, ecx, edx;
    
    __cpuid(1, eax, ebx, ecx, edx);
    f.sse    = (edx >> 25) & 1;
    f.sse2   = (edx >> 26) & 1;
    f.sse3   = (ecx >> 0) & 1;
    f.ssse3  = (ecx >> 9) & 1;
    f.sse41  = (ecx >> 19) & 1;
    f.sse42  = (ecx >> 20) & 1;
    f.aesni  = (ecx >> 25) & 1;
    f.pclmul = (ecx >> 1) & 1;
    f.avx    = (ecx >> 28) & 1;
    f.fma    = (ecx >> 12) & 1;
    f.f16c   = (ecx >> 29) & 1;
    f.popcnt = (ecx >> 23) & 1;
    f.rdrand = (ecx >> 30) & 1;
    
    __cpuid_count(7, 0, eax, ebx, ecx, edx);
    f.avx2       = (ebx >> 5) & 1;
    f.bmi1       = (ebx >> 3) & 1;
    f.bmi2       = (ebx >> 8) & 1;
    f.avx512f    = (ebx >> 16) & 1;
    f.avx512dq   = (ebx >> 17) & 1;
    f.avx512ifma = (ebx >> 21) & 1;
    f.avx512bw   = (ebx >> 30) & 1;
    f.avx512vl   = (ebx >> 31) & 1;
    f.avx512vbmi = (ecx >> 1) & 1;
    f.avx512vnni = (ecx >> 11) & 1;
    f.rdseed     = (ebx >> 18) & 1;
    
    return f;
}

About

Top 10 x86 Instructions Covering Hardware-Accelerated Cryptography & Finite Field Arithmetic, Vectorised String Processing & Text Mining, Precision Scientific Computing, Digital Signal Processing, FFI, AVX-512 SIMD, Galois Fields (GF(2ⁿ)), AES-NI & Hardware Transactional Memory — Including 10 Honorable Mentions

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors