Blame

39a6ca Lorphos 2026-03-08 09:47:19
additional explanations
1
# Clustering with RDMA
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
2
39a6ca Lorphos 2026-03-08 09:47:19
additional explanations
3
With RDMA and low latencies like 1µs, tensor parallelism can provide a speedup.
4
Unfortunately, it's not yet possible using the USB4/Thunderbolt 3 ports of the Strix Halo.
5
So we need some extra hardware: Network adapters that are able to offload the CPU for this task, connected via PCIe.
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
6
39a6ca Lorphos 2026-03-08 09:47:19
additional explanations
7
## Clustering with Oculink and PCIe 3.0 Infiniband cards
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
8
39a6ca Lorphos 2026-03-08 09:47:19
additional explanations
9
The two Bosgame M5 PCs used for this setup have neither an Oculink port nor a PCIe slot. So we use M.2 to Oculink adapters to get PCIe 4.0 x4 for the NICs. Here's some hardware used for a setup with cheap used Mellanox cards. The more recent PCIe 4.0 cards are quite a bit more expensive than the older cards. The PCIe 3.0 x4 connection limits the cards to speeds of around 26GBit/s. Not too shabby.
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
10
11
* 2x Strix Halo with a spare M.2 slot (tested using Bosgame M5)
d93bad Lorphos 2026-04-12 13:58:50
dmesg
12
* 1x ATX PC PSU (any will do, needs just 20 Watts). I'm using a PicoPSU (20€).
13
* 2x Mellanox ConnectX-3 CX354A PCIe 3.0 x8 infiniband cards, used, 23€ each.
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
14
* 1x DAC cable Mellanox 56G QSFP+ FDR InfiniBand DAC Copper Twinax Passiv 0.5m MC2207130-00A, used, 18€ [example link](https://www.ebay.de/itm/126922287689)
15
* 1x ATX PSU 24pin splitter cable [example link](https://a.aliexpress.com/_Ezm7My8) ($6 with coins)
d93bad Lorphos 2026-04-12 13:58:50
dmesg
16
* 2x Oculink M.2 adapter, cable, PCIe 4.0 x16 slot [example link](https://a.aliexpress.com/_Ez9CgPK) (~$25 each with coins and coupons)
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
17
d93bad Lorphos 2026-04-12 13:58:50
dmesg
18
Total cost: 20€+46€+18€+49€ = 133€ Not bad!
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
19
20
What else is needed:
21
22
* a little 3d printed custom case for the two network cards
90dc28 Lorphos 2026-03-08 10:08:17
fix
23
* 2x 3d printed lids for the SSD compartment with a hole for the Oculink cable. Or you drill a hole in the original metal lids.
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
24
* a little fan to keep the Mellanox cards cool inside the case (they use up to 10W each)
25
26
### Quick howto:
27
28
1. Connect Oculink M.2 adapters to the empty M.2 NVMe slots (1 per PC).
29
2. Plug Oculink cables into M.2 adapters and into PCIe 4.0 x16 slot adapters.
30
3. Plug 24pin PSU split cable into both PCIe 4.0 x16 slot adapters and into PSU.
31
4. Plug the two Mellanox cards into the PCIe slots
32
5. Connect the two Mellanox cards with the DAC cable
33
6. Using the switch on the PCIe 4.0 x16 slot adapter, turn on the PSU.
34
7. Finally, turn on the PCs.
35
36
Check if you can see the Mellanox cards in `lspci`:
37
38
```$ lspci
39
40
c3:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
41
42
```
43
Make sure the NIC is connected via PCIe 3.0 x4:
44
```$ sudo lspci -vv -s c3:00.0 |grep -E "LnkCap:|LnkSta:"
45
LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited
46
LnkSta: Speed 8GT/s, Width x4 (downgraded)
47
```
d93bad Lorphos 2026-04-12 13:58:50
dmesg
48
It should also appear in your dmesg, like this:
49
```$ sudo dmesg |grep mlx4
50
[ 2.762576] mlx4_core: Mellanox ConnectX core driver v4.0-0
51
[ 2.762587] mlx4_core: Initializing 0000:c3:00.0
52
[ 2.762633] mlx4_core 0000:c3:00.0: enabling device (0000 -> 0002)
53
[ 9.162204] mlx4_core 0000:c3:00.0: DMFS high rate steer mode is: disabled performance optimized steering
54
[ 9.162913] mlx4_core 0000:c3:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:00:02.5 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
55
[ 9.402996] <mlx4_ib> mlx4_ib_probe: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
56
[ 9.404284] <mlx4_ib> mlx4_ib_probe: counter index 0 for port 1 allocated 0
57
[ 9.404286] <mlx4_ib> mlx4_ib_probe: counter index 1 for port 2 allocated 0
58
[ 10.781441] mlx4_core 0000:c3:00.0 ibp195s0: renamed from ib0
59
[ 10.781830] mlx4_core 0000:c3:00.0 ibp195s0d1: renamed from ib1
60
[ 12.486493] mlx4_core 0000:c3:00.0 ibp195s0d1: "NetworkManager" wants to know my dev_id. Should it look at dev_port instead? See Documentation/ABI/testing/sysfs-class-net for more info.
61
[ 1943.886040] mlx4_core 0000:c3:00.0 ibp195s0: Port: 1 Link INIT
62
[ 1943.941515] mlx4_core 0000:c3:00.0 ibp195s0: Port: 1 Link ACTIVE
63
```
355af7 Lorphos 2026-05-06 20:44:58
hide problematic vlan tuning
64
<!-- this is problematic
d93bad Lorphos 2026-04-12 13:58:50
dmesg
65
To enable performance optimized steering (and surrender VLAN support), edit
66
`/etc/modprobe.d/mlx4.conf` and add this line:
67
```options mlx4_core log_num_mgm_entry_size=-7
68
```
355af7 Lorphos 2026-05-06 20:44:58
hide problematic vlan tuning
69
as mentioned in the [driver documentation](https://doc.dpdk.org/guides/nics/mlx4.html). -->
d93bad Lorphos 2026-04-12 13:58:50
dmesg
70
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
71
Install needed packages on both PCs running Fedora 43:
72
```$ sudo dnf install rdma-core libibverbs-utils mstflint infiniband-diags perftest
73
$ ibv_devinfo
74
```
d93bad Lorphos 2026-04-12 13:58:50
dmesg
75
look for "Link Layer", it should show Infiniband
e794e1 Lorphos 2026-03-07 23:26:37
updates
76
77
On PC1 we start **opensm**, the Infiniband subnet manager and administration:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
78
```$ sudo dnf install opensm
79
$ sudo systemctl enable --now opensm
80
$ sudo restorecon -v /var/log/opensm.log
81
82
$ ibstat
83
```
84
now shows „State: Active“ on both PCs
85
86
PC1:
87
```$ ip a|grep -B 1 infini
88
4: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000
89
link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
90
5: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000
91
link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
92
```
93
PC2:
94
```$ ip a|grep -B 1 infini
95
3: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000
96
link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
97
4: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000
98
link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
99
```
100
So the interface name is **ibp195s0** on both PCs.
101
102
configure IPv4 on PC1:
103
```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.1/24
104
Verbindung »ib-conn« (e6655fba-ebd6-4ee5-a31b-9c25faacfe37) erfolgreich hinzugefügt.
105
```
106
configure IPv4 on PC2:
107
```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.2/24
108
$ sudo nmcli conn up ib-conn
109
$ sudo nmcli conn show
110
```
111
PC1: (I also have a connection via Thunderbolt)
112
```$ sudo nmcli conn up ib-conn
113
$ sudo nmcli conn show
114
NAME UUID TYPE DEVICE
115
Kabelgebundene Verbindung 1 1a44c330-8d06-34d6-9773-df0a34882a4b ethernet eno1
116
ib-conn e6655fba-ebd6-4ee5-a31b-9c25faacfe37 infiniband ibp195s0
117
thunderbolt0 7beaa789-b367-4810-ba22-3e946edab0fd ethernet thunderbolt0
118
```
119
PC2:
120
```$ sudo nmcli conn show
121
NAME UUID TYPE DEVICE
122
Kabelgebundene Verbindung 1 dea9361f-0f51-3acf-9b85-04a35c116b67 ethernet eno1
123
ib-conn 5eaa86fe-99e7-48c9-b460-740d31adc936 infiniband ibp195s0
124
thunderbolt0 bd7e1a3c-f05d-3a43-bfc0-880fb874dba4 ethernet thunderbolt0
125
```
126
Check with „ip a“ if the infiniband interfaces are up. If not, check on PC1 if opensm is giving errors?
e794e1 Lorphos 2026-03-07 23:26:37
updates
127
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
128
OK, if the connection is up, we can check the bandwidth:
e794e1 Lorphos 2026-03-07 23:26:37
updates
129
130
On PC1:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
131
```$ ib_write_bw
132
```
e794e1 Lorphos 2026-03-07 23:26:37
updates
133
On PC2:
ea80ba Lorphos 2026-03-08 09:53:43
fix
134
```$ ib_write_bw 192.168.100.1
135
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
136
65536 5000 3293.63 3293.56 0.052697
137
```
138
and we can check the latency:
e794e1 Lorphos 2026-03-07 23:26:37
updates
139
140
On PC1:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
141
```$ ib_write_lat
142
```
e794e1 Lorphos 2026-03-07 23:26:37
updates
143
On PC2:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
144
```$ ib_write_lat 192.168.100.1
145
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
146
2 1000 1.10 2.05 1.11 1.12 0.00 1.19 2.05
147
```
ea80ba Lorphos 2026-03-08 09:53:43
fix
148
So around 1.12µs which is an expected value. Great!
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
149
150
Next, follow the [AMD Strix Halo RDMA Cluster Setup Guide](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md)
151
152
To be continued, it's still work in progress.