Blame

8561cc Lorphos 2026-01-09 23:25:44 1
# Clustering
2
3
The up to 128GB of memory provided by Strix Halo systems are a lot of memory, but sometimes you want more 😈. Luckily with **Llama.cpp** there is an option to utilize multiple systems and take advantage of their memory to run inference with even larger models.
4
5
## Networking
6
720e31 Lorphos 2026-01-10 19:27:50 7
All Strix Halo systems have two USB4 v1 ports (40GBit/s) that are also Thunderbolt 3 compatible.
8561cc Lorphos 2026-01-09 23:25:44 8
If you connect Strix Halo systems using a Thunderbolt 3 cable, you should see a thunderbolt-net connection in NetworkManager rightaway. This will provide around 9 GBit/s of bandwidth.
9
720e31 Lorphos 2026-01-10 19:27:50 10
There doesn't seem to be routing with thunderbolt-net, so with two USB4 ports you can directly connect three Strix Halo systems for now. You can of course try to use other means to cluster more systems, like the Ethernet ports or connecting Infiniband Adapters via M.2 slots.
8561cc Lorphos 2026-01-09 23:25:44 11
12
### Cabling
13
720e31 Lorphos 2026-01-10 19:27:50 14
You need cables that can handle 40 GBit/s. The cheapest cables that are known to work are "_UGOURD Thunderbolt 40gbps_" that you can usually get for less than $5 each at 0.3m length from AliExpress. Good luck!
8561cc Lorphos 2026-01-09 23:25:44 15
e13a0d Lorphos 2026-01-13 22:56:50 16
Note that the popular Sixunited AXB35 board has one USB4 port on the front and one on the back.
8561cc Lorphos 2026-01-09 23:25:44 17
18
### Bonding
19
7ed44a deseven 2026-01-10 01:16:18 20
Bonding is using multiple network interfaces to get higher throughput, kind of like RAID for harddisks. There is a pending [kernel patch](https://lore.kernel.org/netdev/20251215121109.4042218-1-mika.westerberg@linux.intel.com/T/#t) that will allow bonding with thunderbolt-net. Until then it won't work.
8561cc Lorphos 2026-01-09 23:25:44 21
22
## Llama.cpp with RPC
23
a788b5 Lorphos 2026-01-10 19:34:26 24
The Llama.cpp RPC architecture is [explained in the documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md). The version of Llama.cpp provided by [kyuz0s toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) are compiled with this option _enabled_.
8561cc Lorphos 2026-01-09 23:25:44 25
e0a054 Lorphos 2026-01-09 23:31:04 26
Run `rpc-server` (part of llama.cpp) on all but one PC. It will make available your iGPU to the master PC. Use the `-c` option to make it cache the LLM data on the local disk (in the directory `~/.cache/llama.cpp/rpc/`). This will speed things up considerably on subsequent invocations of the same LLM.
8561cc Lorphos 2026-01-09 23:25:44 27
a788b5 Lorphos 2026-01-10 19:34:26 28
On the **master** PC, you start `llama-server` with the `--rpc` option and provide it with the addresses of the other PCs. If you use Thunderbolt networking, make sure to give the addresses of the Thunderbolt interfaces.
8561cc Lorphos 2026-01-09 23:25:44 29
e0a054 Lorphos 2026-01-09 23:31:04 30
### For example
8561cc Lorphos 2026-01-09 23:25:44 31
a788b5 Lorphos 2026-01-10 19:34:26 32
1. Master PC with thunderbolt0 interface and IPv4 address 192.168.230.1
8561cc Lorphos 2026-01-09 23:25:44 33
a788b5 Lorphos 2026-01-10 19:34:26 34
2. Second PC with thunderbolt0 interface and IPv4 address 192.168.230.2
8561cc Lorphos 2026-01-09 23:25:44 35
36
This second PC is running "`rpc-server -c`"
37
e0a054 Lorphos 2026-01-09 23:31:04 38
On the master PC start `llama-server` as usual, but add the parameter `--rpc 192.168.230.2:50052`. If you have three PCs, add the third PC with a comma as separator: `--rpc 192.168.230.2:50052,192.168.230.3:50052`.
8561cc Lorphos 2026-01-09 23:25:44 39
Voilà!
40
41
## vLLM
42
43
vLLM is also capable of utilizing GPUs across multiple PCs. You have to setup a Ray cluster. If you get it to work, **please** document it here.
44
45
## Community
46
7ed44a deseven 2026-01-10 01:16:18 47
For further discussion join the [#beyond128g](https://discord.com/channels/1384139280020148365/1455307501472976979) discord channel.