Blame

b61930 Lorphos 2026-03-07 23:07:26
renamed
1
# Hardware for clustering with RDMA
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
2
3
## Clustering with Oculink and PCIe 3.0 Infiniband cards
4
e794e1 Lorphos 2026-03-07 23:26:37
updates
5
The more recent PCIe 4.0 cards are quite a bit more expensive than the older cards. The PCIe 3.0 x4 connection limits the cards to speeds of around 26GBit/s. Not too shabby.
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
6
7
Here's some hardware used for a setup with cheap used Mellanox cards:
8
9
* 2x Strix Halo with a spare M.2 slot (tested using Bosgame M5)
10
* 1x ATX PC PSU (any will do, needs just 20 Watts)
11
* 2x Mellanox ConnectX-3 CX354A PCIe 3.0 x8 infiniband cards, used, 23€ each [example link](https://www.ebay.de/itm/177760210929?_skw=cx354a&epid=7043214331&itmmeta=01KK55RMHKERWWE085D2FZ4C5G&hash=item2963558ff1:g:u1MAAeSw7MRpYSrx)
12
* 1x DAC cable Mellanox 56G QSFP+ FDR InfiniBand DAC Copper Twinax Passiv 0.5m MC2207130-00A, used, 18€ [example link](https://www.ebay.de/itm/126922287689)
13
* 1x ATX PSU 24pin splitter cable [example link](https://a.aliexpress.com/_Ezm7My8) ($6 with coins)
14
* 2x Oculink M.2 adapter, capble, PCIe 4.0 x16 slot [example link](https://a.aliexpress.com/_Ez9CgPK) (~$25 each with coins and coupons)
15
16
Total cost: 46€+18€+49€ =113€ Not bad!
17
18
What else is needed:
19
20
* a little 3d printed custom case for the two network cards
21
* 2x 3d printed covers for the SSD compartment with a hole for the Oculink cable. Or you drill a whole in the original metal covers.
22
* a little fan to keep the Mellanox cards cool inside the case (they use up to 10W each)
23
24
### Quick howto:
25
26
1. Connect Oculink M.2 adapters to the empty M.2 NVMe slots (1 per PC).
27
2. Plug Oculink cables into M.2 adapters and into PCIe 4.0 x16 slot adapters.
28
3. Plug 24pin PSU split cable into both PCIe 4.0 x16 slot adapters and into PSU.
29
4. Plug the two Mellanox cards into the PCIe slots
30
5. Connect the two Mellanox cards with the DAC cable
31
6. Using the switch on the PCIe 4.0 x16 slot adapter, turn on the PSU.
32
7. Finally, turn on the PCs.
33
34
Check if you can see the Mellanox cards in `lspci`:
35
36
```$ lspci
37
38
c3:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
39
40
```
41
Make sure the NIC is connected via PCIe 3.0 x4:
42
```$ sudo lspci -vv -s c3:00.0 |grep -E "LnkCap:|LnkSta:"
43
LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited
44
LnkSta: Speed 8GT/s, Width x4 (downgraded)
45
```
46
Install needed packages on both PCs running Fedora 43:
47
```$ sudo dnf install rdma-core libibverbs-utils mstflint infiniband-diags perftest
48
$ ibv_devinfo
49
```
50
look for „Link Layer“, it should show Infiniband
e794e1 Lorphos 2026-03-07 23:26:37
updates
51
52
On PC1 we start **opensm**, the Infiniband subnet manager and administration:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
53
```$ sudo dnf install opensm
54
$ sudo systemctl enable --now opensm
55
$ sudo restorecon -v /var/log/opensm.log
56
57
$ ibstat
58
```
59
now shows „State: Active“ on both PCs
60
61
PC1:
62
```$ ip a|grep -B 1 infini
63
4: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000
64
link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
65
5: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000
66
link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
67
```
68
PC2:
69
```$ ip a|grep -B 1 infini
70
3: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000
71
link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
72
4: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000
73
link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
74
```
75
So the interface name is **ibp195s0** on both PCs.
76
77
configure IPv4 on PC1:
78
```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.1/24
79
Verbindung »ib-conn« (e6655fba-ebd6-4ee5-a31b-9c25faacfe37) erfolgreich hinzugefügt.
80
```
81
configure IPv4 on PC2:
82
```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.2/24
83
$ sudo nmcli conn up ib-conn
84
$ sudo nmcli conn show
85
```
86
PC1: (I also have a connection via Thunderbolt)
87
```$ sudo nmcli conn up ib-conn
88
$ sudo nmcli conn show
89
NAME UUID TYPE DEVICE
90
Kabelgebundene Verbindung 1 1a44c330-8d06-34d6-9773-df0a34882a4b ethernet eno1
91
ib-conn e6655fba-ebd6-4ee5-a31b-9c25faacfe37 infiniband ibp195s0
92
thunderbolt0 7beaa789-b367-4810-ba22-3e946edab0fd ethernet thunderbolt0
93
```
94
PC2:
95
```$ sudo nmcli conn show
96
NAME UUID TYPE DEVICE
97
Kabelgebundene Verbindung 1 dea9361f-0f51-3acf-9b85-04a35c116b67 ethernet eno1
98
ib-conn 5eaa86fe-99e7-48c9-b460-740d31adc936 infiniband ibp195s0
99
thunderbolt0 bd7e1a3c-f05d-3a43-bfc0-880fb874dba4 ethernet thunderbolt0
100
```
101
Check with „ip a“ if the infiniband interfaces are up. If not, check on PC1 if opensm is giving errors?
e794e1 Lorphos 2026-03-07 23:26:37
updates
102
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
103
OK, if the connection is up, we can check the bandwidth:
e794e1 Lorphos 2026-03-07 23:26:37
updates
104
105
On PC1:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
106
```$ ib_write_bw
107
```
e794e1 Lorphos 2026-03-07 23:26:37
updates
108
On PC2:
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
109
```$ ib_write_bw 192.168.100.1
110
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
111
65536 5000 3293.63 3293.56 0.052697
112
```
e794e1 Lorphos 2026-03-07 23:26:37
updates
113
and we can check the latency:
114
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
115
On PC1:
116
```$ ib_write_lat
e794e1 Lorphos 2026-03-07 23:26:37
updates
117
```
be9f18 Lorphos 2026-03-07 23:06:58
initial version, WIP
118
On PC2:
119
```$ ib_write_lat 192.168.100.1
120
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
121
2 1000 1.10 2.05 1.11 1.12 0.00 1.19 2.05
122
```
123
So around 1.12µs which is an expected value.
124
125
Next, follow the [AMD Strix Halo RDMA Cluster Setup Guide](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md)
126