[Issue]: ROCm 7+ Performance regression on llama.cpp

### Problem Description

There is a considerable performance regression when using Llama.cpp going from ROCM 6.4.4 to 7.2 or the ROCm nightly builds from TheRock.

Source: https://github.com/kyuz0/amd-strix-halo-toolboxes/issues/45#issuecomment-3796854322

Llama.cpp with ROCm 6.4.4 is faster than when using 7.2, which is the worst performance regression (3x slower !!), and 7-nightlies (from TheRock), almost 2x slower than 6.4.4:

Examples:

```rocm-7-nightlies

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 pp512 815.27 ± 7.37
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 tg128 72.97 ± 0.29
build: 8f91ca54e (7822)

rocm-7.2

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 pp512 545.11 ± 6.65
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 tg128 73.21 ± 0.06
build: 8f91ca54e (7822)

rocm-6.4.4

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 pp512 1648.22 ± 20.43
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 999 2048 1 tg128 72.96 ± 0.05
build: 8f91ca54e (7822)
```

Full table here for many model architectures and quantizations: https://kyuz0.github.io/amd-strix-halo-toolboxes/



### Operating System

Fedora 43 (6.18.3-200)

### CPU

AMD Ryzen AI MAX 395+

### GPU

Strix Halo gfx1151

### ROCm Version

ROCm 7+

### ROCm Component

_No response_

### Steps to Reproduce

_No response_

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: ROCm 7+ Performance regression on llama.cpp #2865

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: ROCm 7+ Performance regression on llama.cpp #2865

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions