Clustering

The up to 128GB of memory provided by Strix Halo systems are a lot of memory, but sometimes you want more 😈. Luckily with Llama.cpp there is an option to utilize multiple systems and take advantage of their memory to run inference with even larger models.

Networking

All Strix Halo systems have two USB4 v1 ports (40GBit/s) that are also Thunderbolt 3 compatible. If you connect Strix Halo systems using a Thunderbolt 3 cable, you should see a thunderbolt-net connection in NetworkManager rightaway. This will provide around 9 GBit/s of bandwidth.

There doesn't seem to be routing with thunderbolt-net, so with two USB4 ports you can directly connect three Strix Halo systems for now. You can of course try to use other means to cluster more systems, like the Ethernet ports or connecting Infiniband Adapters via M.2 slots.

Cabling

You need cables that can handle 40 GBit/s. The cheapest cables that are known to work are "UGOURD Thunderbolt 40gbps" that you can usually get for less than $5 each at 0.3m length from AliExpress. Good luck!

Note that the popular Sixunited AXB35 board has one USB4 port on the front and one on the back.

Bonding

Bonding is using multiple network interfaces to get higher throughput, kind of like RAID for harddisks. There is a pending kernel patch that will allow bonding with thunderbolt-net. Until then it won't work.

Llama.cpp with RPC

The Llama.cpp RPC architecture is explained in the documentation. The version of Llama.cpp provided by kyuz0s toolboxes are compiled with this option enabled.

Run rpc-server (part of llama.cpp) on all but one PC. It will make available your iGPU to the master PC. Use the -c option to make it cache the LLM data on the local disk (in the directory ~/.cache/llama.cpp/rpc/). This will speed things up considerably on subsequent invocations of the same LLM.

On the master PC, you start llama-server with the --rpc option and provide it with the addresses of the other PCs. If you use Thunderbolt networking, make sure to give the addresses of the Thunderbolt interfaces.

For example

Master PC with thunderbolt0 interface and IPv4 address 192.168.230.1
Second PC with thunderbolt0 interface and IPv4 address 192.168.230.2

This second PC is running "rpc-server -c"

On the master PC start llama-server as usual, but add the parameter --rpc 192.168.230.2:50052. If you have three PCs, add the third PC with a comma as separator: --rpc 192.168.230.2:50052,192.168.230.3:50052. Voilà!

vLLM

vLLM is also capable of utilizing GPUs across multiple PCs. You have to setup a Ray cluster. If you get it to work, please document it here.

Community

For further discussion join the #beyond128g discord channel.

Last modified 2026-01-13 22:56:50 by Lorphos.
Created 2026-01-09 23:25:44 by Lorphos.