Skip to content

[Issue]: ROCm 7+ Performance regression on llama.cpp #2865

@kyuz0

Description

@kyuz0

Problem Description

There is a considerable performance regression when using Llama.cpp going from ROCM 6.4.4 to 7.2 or the ROCm nightly builds from TheRock.

Source: kyuz0/amd-strix-halo-toolboxes#45 (comment)

Llama.cpp with ROCm 6.4.4 is faster than when using 7.2, which is the worst performance regression (3x slower !!), and 7-nightlies (from TheRock), almost 2x slower than 6.4.4:

Examples:


model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 pp512 815.27 ± 7.37
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 tg128 72.97 ± 0.29
build: 8f91ca54e (7822)

rocm-7.2

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 pp512 545.11 ± 6.65
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 tg128 73.21 ± 0.06
build: 8f91ca54e (7822)

rocm-6.4.4

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 pp512 1648.22 ± 20.43
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 tg128 72.96 ± 0.05
build: 8f91ca54e (7822)

Full table here for many model architectures and quantizations: https://kyuz0.github.io/amd-strix-halo-toolboxes/

Operating System

Fedora 43 (6.18.3-200)

CPU

AMD Ryzen AI MAX 395+

GPU

Strix Halo gfx1151

ROCm Version

ROCm 7+

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions