llamacpp-with-ROCm - blame None

Blame

e0b6ab	lhl	2025-08-11 12:57:39
first swag

# llama.cpp with ROCm

> [!WARNING]

> This is a technical guide and assumes a certain level of technical knowledge. If there are confusing parts or you run into issues, I recommend using a strong LLM with research/grounding and reasoning abilities (eg Claude Sonnet 4) to assist.

435668	lhl	2025-08-18 03:12:47
extra rocm info

While Vulkan can sometimes have faster `tg` speeds, it can run into "GGGG" issues in many situations, and if you want the fastest `pp` speeds, you probably will want to try the ROCm backend.

697839	lhl	2025-09-21 06:01:29
typo

As of August 2025, the generally fastest/most stable llama.cpp ROCm combination:

435668	lhl	2025-08-18 03:12:47
extra rocm info

- build llama.cpp with rocWMMA: `-DGGML_HIP_ROCWMMA_FATTN=ON`

- run llama.cpp with env to use hipBLASlt: `ROCBLAS_USE_HIPBLASLT=1`

There are still some GPU hangs, see:

- https://github.com/ROCm/ROCm/issues/5151

e0b6ab	lhl	2025-08-11 12:57:39
first swag

If you are looking for pre-built llama.cpp ROCm binaries, first check out:

00c3bd	lhl	2025-08-12 08:07:38
build updates

- Lemonade's [llamacpp-rocm](https://github.com/lemonade-sdk/llamacpp-rocm) - automated [builds](https://github.com/lemonade-sdk/llamacpp-rocm/releases) against the latest ROCm pre-release for gfx1151,gfx120X,gfx110X ([rocWMMA in progress](https://github.com/lemonade-sdk/llamacpp-rocm/issues/7))

435668	lhl	2025-08-18 03:12:47
extra rocm info

- kyuz0's [AMD Strix Halo Llama.cpp Toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) container builds

00c3bd	lhl	2025-08-12 08:07:38
build updates

- [nix-strix-halo](https://github.com/hellas-ai/nix-strix-halo) - Nix flake

e0b6ab	lhl	2025-08-11 12:57:39
first swag

## Building llama.cpp with ROCm

00c3bd	lhl	2025-08-12 08:07:38
build updates

If you want or need to build it yourself, you can basically just follow the [llama.cpp build guide](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hipblas):

e0b6ab	lhl	2025-08-11 12:57:39
first swag

```

git clone https://github.com/ggml-org/llama.cpp

cd llama.cpp

# build w/o rocWMMA

cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j$(nproc)

00c3bd	lhl	2025-08-12 08:07:38
build updates

# really, you want to build w/ rocWMMA

e0b6ab	lhl	2025-08-11 12:57:39
first swag

cmake -B build -S . -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" -DGGML_HIP_ROCWMMA_FATTN=ON && time cmake --build build --config Release -j$(nproc)

# after about 2 minutes you should have a freshly baked llama.cpp in build/bin:

build/bin/llama-bench --mmap 0 -fa 1 -m /models/gguf/llama-2-7b.Q4_K_M.gguf

```

Of course, to build, you need some dependencies sorted.

435668	lhl	2025-08-18 03:12:47
extra rocm info

First, you should run the latest Linux (6.16+) and linux-firmware (git).

e0b6ab	lhl	2025-08-11 12:57:39
first swag

## ROCm

d8440e	deseven	2026-02-02 13:01:45
updated links

You'll need ROCm installed first before you can build. For best performance you'll want to use the latest ROCm/TheRock nightlies. See: [[AI_Capabilities_Overview|AI/AI_Capabilities_Overview#rocm]]

e0b6ab	lhl	2025-08-11 12:57:39
first swag

To build, you may need to make sure your environment variables are properly set. If so, take a look at [https://github.com/lhl/strix-halo-testing/blob/main/rocm-therock-env.sh](https://github.com/lhl/strix-halo-testing/blob/main/rocm-therock-env.sh) for an example of what this might look like. Change `ROCM_PATH` to whatever your ROCm path is.

## rocWMMA

7d81dd	lhl	2025-10-30 18:55:56
added note to not use upstream ROCWMMA

ece1a3	lhl	2025-10-30 18:56:26
warning

> [!WARNING]

7d81dd	lhl	2025-10-30 18:55:56
added note to not use upstream ROCWMMA

> As of ROCm 7.0.2+ the ROCWMMA flag/path *SHOULD NOT BE USED* for Strix Halo with llama.cpp upstream - it's slower than the regular ROCm/HIP path as context depth increases and is not receiving any updates until a rewrite happens

e0b6ab	lhl	2025-08-11 12:57:39
first swag

Your ROCm probably has the rocWMMA libraries installed already. If not, you'll want them in your rocm folder. This is relatively straightforward (we only need the library installed, but you can refer to [https://github.com/lhl/strix-halo-testing/blob/main/arch-torch/02-build-rocwwma.sh](https://github.com/lhl/strix-halo-testing/blob/main/arch-torch/02-build-rocwwma.sh) for building this.

7d81dd	lhl	2025-10-30 18:55:56
added note to not use upstream ROCWMMA

## 2025-10-31 rocWMMA

If you are building your own ROCm rocWMMA build, be sure to take a look at [llama-cpp-fix-wmma](https://github.com/lhl/strix-halo-testing/tree/main/llama-cpp-fix-wmma) - there is a [rocm-wmma-tune branch](https://github.com/lhl/llama.cpp/tree/rocm-wmma-tune) that performs significantly better at longer context depths.

- Fullest writeup with all relevant links is here: https://www.reddit.com/r/LocalLLaMA/comments/1ok7hd4/faster_llamacpp_rocm_performance_for_amd_rdna3/