Blame

39a6ca Lorphos 2026-03-08 09:47:19
additional explanations
1
# Clustering with RDMA
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
2
39a6ca Lorphos 2026-03-08 09:47:19
additional explanations
3
With RDMA and low latencies like 1µs, tensor parallelism can provide a speedup.
4
Unfortunately, it's not yet possible using the USB4/Thunderbolt 3 ports of the Strix Halo.
5
So we need some extra hardware: Network adapters that are able to offload the CPU for this task, connected via PCIe.
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
6
39a6ca Lorphos 2026-03-08 09:47:19
additional explanations
7
## Clustering with Oculink and PCIe 3.0 Infiniband cards
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
8
39a6ca Lorphos 2026-03-08 09:47:19
additional explanations
9
The two Bosgame M5 PCs used for this setup have neither an Oculink port nor a PCIe slot. So we use M.2 to Oculink adapters to get PCIe 4.0 x4 for the NICs. Here's some hardware used for a setup with cheap used Mellanox cards. The more recent PCIe 4.0 cards are quite a bit more expensive than the older cards. The PCIe 3.0 x4 connection limits the cards to speeds of around 26GBit/s. Not too shabby.
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
10
11
* 2x Strix Halo with a spare M.2 slot (tested using Bosgame M5)
d93bad Lorphos 2026-04-12 13:58:50
dmesg
12
* 1x ATX PC PSU (any will do, needs just 20 Watts). I'm using a PicoPSU (20€).
13
* 2x Mellanox ConnectX-3 CX354A PCIe 3.0 x8 infiniband cards, used, 23€ each.
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
14
* 1x DAC cable Mellanox 56G QSFP+ FDR InfiniBand DAC Copper Twinax Passiv 0.5m MC2207130-00A, used, 18€ [example link](https://www.ebay.de/itm/126922287689)
15
* 1x ATX PSU 24pin splitter cable [example link](https://a.aliexpress.com/_Ezm7My8) ($6 with coins)
d93bad Lorphos 2026-04-12 13:58:50
dmesg
16
* 2x Oculink M.2 adapter, cable, PCIe 4.0 x16 slot [example link](https://a.aliexpress.com/_Ez9CgPK) (~$25 each with coins and coupons)
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
17
d93bad Lorphos 2026-04-12 13:58:50
dmesg
18
Total cost: 20€+46€+18€+49€ = 133€ Not bad!
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
19
20
What else is needed:
21
22
* a little 3d printed custom case for the two network cards
90dc28 Lorphos 2026-03-08 10:08:17
fix
23
* 2x 3d printed lids for the SSD compartment with a hole for the Oculink cable. Or you drill a hole in the original metal lids.
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
24
* a little fan to keep the Mellanox cards cool inside the case (they use up to 10W each)
25
26
### Quick howto:
27
28
1. Connect Oculink M.2 adapters to the empty M.2 NVMe slots (1 per PC).
29
2. Plug Oculink cables into M.2 adapters and into PCIe 4.0 x16 slot adapters.
30
3. Plug 24pin PSU split cable into both PCIe 4.0 x16 slot adapters and into PSU.
31
4. Plug the two Mellanox cards into the PCIe slots
32
5. Connect the two Mellanox cards with the DAC cable
33
6. Using the switch on the PCIe 4.0 x16 slot adapter, turn on the PSU.
34
7. Finally, turn on the PCs.
35
36
Check if you can see the Mellanox cards in `lspci`:
37
38
```$ lspci
39
40
c3:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
41
42
```
43
Make sure the NIC is connected via PCIe 3.0 x4:
44
```$ sudo lspci -vv -s c3:00.0 |grep -E "LnkCap:|LnkSta:"
45
LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited
46
LnkSta: Speed 8GT/s, Width x4 (downgraded)
47
```
d93bad Lorphos 2026-04-12 13:58:50
dmesg
48
It should also appear in your dmesg, like this:
49
```$ sudo dmesg |grep mlx4
50
[ 2.762576] mlx4_core: Mellanox ConnectX core driver v4.0-0
51
[ 2.762587] mlx4_core: Initializing 0000:c3:00.0
52
[ 2.762633] mlx4_core 0000:c3:00.0: enabling device (0000 -> 0002)
53
[ 9.162204] mlx4_core 0000:c3:00.0: DMFS high rate steer mode is: disabled performance optimized steering
54
[ 9.162913] mlx4_core 0000:c3:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:00:02.5 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
55
[ 9.402996] <mlx4_ib> mlx4_ib_probe: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
56
[ 9.404284] <mlx4_ib> mlx4_ib_probe: counter index 0 for port 1 allocated 0
57
[ 9.404286] <mlx4_ib> mlx4_ib_probe: counter index 1 for port 2 allocated 0
58
[ 10.781441] mlx4_core 0000:c3:00.0 ibp195s0: renamed from ib0
59
[ 10.781830] mlx4_core 0000:c3:00.0 ibp195s0d1: renamed from ib1
60
[ 12.486493] mlx4_core 0000:c3:00.0 ibp195s0d1: "NetworkManager" wants to know my dev_id. Should it look at dev_port instead? See Documentation/ABI/testing/sysfs-class-net for more info.
61
[ 1943.886040] mlx4_core 0000:c3:00.0 ibp195s0: Port: 1 Link INIT
62
[ 1943.941515] mlx4_core 0000:c3:00.0 ibp195s0: Port: 1 Link ACTIVE
63
```
64
To enable performance optimized steering (and surrender VLAN support), edit
65
`/etc/modprobe.d/mlx4.conf` and add this line:
66
```options mlx4_core log_num_mgm_entry_size=-7
67
```
68
as mentioned in the [driver documentation](https://doc.dpdk.org/guides/nics/mlx4.html).
69
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
70
Install needed packages on both PCs running Fedora 43:
71
```$ sudo dnf install rdma-core libibverbs-utils mstflint infiniband-diags perftest
72
$ ibv_devinfo
73
```
d93bad Lorphos 2026-04-12 13:58:50
dmesg
74
look for "Link Layer", it should show Infiniband
e794e1 Lorphos 2026-03-07 23:26:37
updates
75
76
On PC1 we start **opensm**, the Infiniband subnet manager and administration:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
77
```$ sudo dnf install opensm
78
$ sudo systemctl enable --now opensm
79
$ sudo restorecon -v /var/log/opensm.log
80
81
$ ibstat
82
```
83
now shows „State: Active“ on both PCs
84
85
PC1:
86
```$ ip a|grep -B 1 infini
87
4: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000
88
link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
89
5: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000
90
link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
91
```
92
PC2:
93
```$ ip a|grep -B 1 infini
94
3: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000
95
link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
96
4: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000
97
link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
98
```
99
So the interface name is **ibp195s0** on both PCs.
100
101
configure IPv4 on PC1:
102
```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.1/24
103
Verbindung »ib-conn« (e6655fba-ebd6-4ee5-a31b-9c25faacfe37) erfolgreich hinzugefügt.
104
```
105
configure IPv4 on PC2:
106
```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.2/24
107
$ sudo nmcli conn up ib-conn
108
$ sudo nmcli conn show
109
```
110
PC1: (I also have a connection via Thunderbolt)
111
```$ sudo nmcli conn up ib-conn
112
$ sudo nmcli conn show
113
NAME UUID TYPE DEVICE
114
Kabelgebundene Verbindung 1 1a44c330-8d06-34d6-9773-df0a34882a4b ethernet eno1
115
ib-conn e6655fba-ebd6-4ee5-a31b-9c25faacfe37 infiniband ibp195s0
116
thunderbolt0 7beaa789-b367-4810-ba22-3e946edab0fd ethernet thunderbolt0
117
```
118
PC2:
119
```$ sudo nmcli conn show
120
NAME UUID TYPE DEVICE
121
Kabelgebundene Verbindung 1 dea9361f-0f51-3acf-9b85-04a35c116b67 ethernet eno1
122
ib-conn 5eaa86fe-99e7-48c9-b460-740d31adc936 infiniband ibp195s0
123
thunderbolt0 bd7e1a3c-f05d-3a43-bfc0-880fb874dba4 ethernet thunderbolt0
124
```
125
Check with „ip a“ if the infiniband interfaces are up. If not, check on PC1 if opensm is giving errors?
e794e1 Lorphos 2026-03-07 23:26:37
updates
126
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
127
OK, if the connection is up, we can check the bandwidth:
e794e1 Lorphos 2026-03-07 23:26:37
updates
128
129
On PC1:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
130
```$ ib_write_bw
131
```
e794e1 Lorphos 2026-03-07 23:26:37
updates
132
On PC2:
ea80ba Lorphos 2026-03-08 09:53:43
fix
133
```$ ib_write_bw 192.168.100.1
134
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
135
65536 5000 3293.63 3293.56 0.052697
136
```
137
and we can check the latency:
e794e1 Lorphos 2026-03-07 23:26:37
updates
138
139
On PC1:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
140
```$ ib_write_lat
141
```
e794e1 Lorphos 2026-03-07 23:26:37
updates
142
On PC2:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
143
```$ ib_write_lat 192.168.100.1
144
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
145
2 1000 1.10 2.05 1.11 1.12 0.00 1.19 2.05
146
```
ea80ba Lorphos 2026-03-08 09:53:43
fix
147
So around 1.12µs which is an expected value. Great!
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
148
149
Next, follow the [AMD Strix Halo RDMA Cluster Setup Guide](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md)
150
151
To be continued, it's still work in progress.