Blame

8561cc Lorphos 2026-01-09 23:25:44
new page about clustering Strix Halo systems
1
# Clustering
2
3
The up to 128GB of memory provided by Strix Halo systems are a lot of memory, but sometimes you want more 😈. Luckily with **Llama.cpp** there is an option to utilize multiple systems and take advantage of their memory to run inference with even larger models.
4
5
## Networking
6
7
All Strix Halo systems have two USB4 ports that are also Thunderbolt 3 compatible (40GBit/s).
8
If you connect Strix Halo systems using a Thunderbolt 3 cable, you should see a thunderbolt-net connection in NetworkManager rightaway. This will provide around 9 GBit/s of bandwidth.
9
10
There doesn't seem to be routing with thunderbolt-net, so with two ports you can connect three Strix Halo systems using for now. You can of course try to use other means to cluster more systems, like the Ethernet ports.
11
12
### Cabling
13
14
You need a cable that can handle 40 GBit/s. The cheapest cables that are known to work are "_UGOURD Thunderbolt 40gbps_" that you can usually get for less than $5 each at 0.3m length from AliExpress. Good luck!
15
16
Note that the [[Bosgame M5|Hardware/PCs/Bosgame_M5]] has one USB4 port on the front and one on the back.
17
18
### Bonding
19
cd3417 Lorphos 2026-01-09 23:29:41
typo
20
Bonding is using multiple network interfaces to get higher throughput, kind of like RAID for harddisks. There is a pending kernel patch that will allow bonding with thunderbolt-net. Until then it won't work.
8561cc Lorphos 2026-01-09 23:25:44
new page about clustering Strix Halo systems
21
22
## Llama.cpp with RPC
23
24
The Llama.cpp RPC architecture is [explained in the documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md). The version of Llama.cpp provided by **kyuz0**s toolboxes are compiled with this option enabled.
25
638314 Lorphos 2026-01-09 23:28:47
consistent naming
26
Run `rpc-server` (part of llama.cpp) on all but one PC. It will make available your iGPU to the master system. Use the `-c` option to make it cache the LLM data on the local disk (in the directory `~/.cache/llama.cpp/rpc/`). This will speed things up considerably on subsequent invocations of the same LLM.
8561cc Lorphos 2026-01-09 23:25:44
new page about clustering Strix Halo systems
27
28
On the **master** PC, you start `llama-server` with the `--rpc` option and provide it with the addresses of the other PCs. If you use thunderbolt networking, make sure to give the addresses of the thunderbolt interfaces.
29
30
### For example:
31
32
- Main PC with thunderbolt0 interface and IPv4 address 192.168.230.1
33
34
- Second PC with thunderbolt0 interface and IPv4 address 192.168.230.2
35
36
This second PC is running "`rpc-server -c`"
37
38
On the main PC start `llama-server` as usual, but add the parameter `--rpc 192.168.230.2:50052`. If you have three PCs, add the third PC with a comma as separator: `--rpc 192.168.230.2:50052,192.168.230.3:50052`.
39
Voilà!
40
41
## vLLM
42
43
vLLM is also capable of utilizing GPUs across multiple PCs. You have to setup a Ray cluster. If you get it to work, **please** document it here.
44
45
## Community
46
47
For further discussion join the [[#beyond128g|https://discord.com/channels/1384139280020148365/1455307501472976979]] discord channel.