Commit be9f18

2026-03-07 23:06:58 Lorphos: initial version, WIP
/dev/null .. AI/Clustering with RDMA.md
@@ 0,0 1,122 @@
+ # Clustering with RDMA
+
+ ## Clustering with Oculink and PCIe 3.0 Infiniband cards
+
+ The more recent PCIe 4.0 cards are quite a bit more expensive than the older cards. The PCIe 3.0 x4 connection limits the cards to speeds of around 26GByte/s. Not too shabby.
+
+ Here's some hardware used for a setup with cheap used Mellanox cards:
+
+ * 2x Strix Halo with a spare M.2 slot (tested using Bosgame M5)
+ * 1x ATX PC PSU (any will do, needs just 20 Watts)
+ * 2x Mellanox ConnectX-3 CX354A PCIe 3.0 x8 infiniband cards, used, 23€ each [example link](https://www.ebay.de/itm/177760210929?_skw=cx354a&epid=7043214331&itmmeta=01KK55RMHKERWWE085D2FZ4C5G&hash=item2963558ff1:g:u1MAAeSw7MRpYSrx)
+ * 1x DAC cable Mellanox 56G QSFP+ FDR InfiniBand DAC Copper Twinax Passiv 0.5m MC2207130-00A, used, 18€ [example link](https://www.ebay.de/itm/126922287689)
+ * 1x ATX PSU 24pin splitter cable [example link](https://a.aliexpress.com/_Ezm7My8) ($6 with coins)
+ * 2x Oculink M.2 adapter, capble, PCIe 4.0 x16 slot [example link](https://a.aliexpress.com/_Ez9CgPK) (~$25 each with coins and coupons)
+
+ Total cost: 46€+18€+49€ =113€ Not bad!
+
+ What else is needed:
+
+ * a little 3d printed custom case for the two network cards
+ * 2x 3d printed covers for the SSD compartment with a hole for the Oculink cable. Or you drill a whole in the original metal covers.
+ * a little fan to keep the Mellanox cards cool inside the case (they use up to 10W each)
+
+ ### Quick howto:
+
+ 1. Connect Oculink M.2 adapters to the empty M.2 NVMe slots (1 per PC).
+ 2. Plug Oculink cables into M.2 adapters and into PCIe 4.0 x16 slot adapters.
+ 3. Plug 24pin PSU split cable into both PCIe 4.0 x16 slot adapters and into PSU.
+ 4. Plug the two Mellanox cards into the PCIe slots
+ 5. Connect the two Mellanox cards with the DAC cable
+ 6. Using the switch on the PCIe 4.0 x16 slot adapter, turn on the PSU.
+ 7. Finally, turn on the PCs.
+
+ Check if you can see the Mellanox cards in `lspci`:
+
+ ```$ lspci
+
+ c3:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
+
+ ```
+ Make sure the NIC is connected via PCIe 3.0 x4:
+ ```$ sudo lspci -vv -s c3:00.0 |grep -E "LnkCap:|LnkSta:"
+ LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited
+ LnkSta: Speed 8GT/s, Width x4 (downgraded)
+ ```
+ Install needed packages on both PCs running Fedora 43:
+ ```$ sudo dnf install rdma-core libibverbs-utils mstflint infiniband-diags perftest
+ $ ibv_devinfo
+ ```
+ look for „Link Layer“, it should show Infiniband
+ on PC1 we start opensm, the Infiniband subnet manager and administration:
+ ```$ sudo dnf install opensm
+ $ sudo systemctl enable --now opensm
+ $ sudo restorecon -v /var/log/opensm.log
+
+ $ ibstat
+ ```
+ now shows „State: Active“ on both PCs
+
+ PC1:
+ ```$ ip a|grep -B 1 infini
+ 4: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000
+ link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
+ 5: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000
+ link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
+ ```
+ PC2:
+ ```$ ip a|grep -B 1 infini
+ 3: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000
+ link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
+ 4: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000
+ link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
+ ```
+ So the interface name is **ibp195s0** on both PCs.
+
+ configure IPv4 on PC1:
+ ```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.1/24
+ Verbindung »ib-conn« (e6655fba-ebd6-4ee5-a31b-9c25faacfe37) erfolgreich hinzugefügt.
+ ```
+ configure IPv4 on PC2:
+ ```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.2/24
+ $ sudo nmcli conn up ib-conn
+ $ sudo nmcli conn show
+ ```
+ PC1: (I also have a connection via Thunderbolt)
+ ```$ sudo nmcli conn up ib-conn
+ $ sudo nmcli conn show
+ NAME UUID TYPE DEVICE
+ Kabelgebundene Verbindung 1 1a44c330-8d06-34d6-9773-df0a34882a4b ethernet eno1
+ ib-conn e6655fba-ebd6-4ee5-a31b-9c25faacfe37 infiniband ibp195s0
+ thunderbolt0 7beaa789-b367-4810-ba22-3e946edab0fd ethernet thunderbolt0
+ ```
+ PC2:
+ ```$ sudo nmcli conn show
+ NAME UUID TYPE DEVICE
+ Kabelgebundene Verbindung 1 dea9361f-0f51-3acf-9b85-04a35c116b67 ethernet eno1
+ ib-conn 5eaa86fe-99e7-48c9-b460-740d31adc936 infiniband ibp195s0
+ thunderbolt0 bd7e1a3c-f05d-3a43-bfc0-880fb874dba4 ethernet thunderbolt0
+ ```
+ Check with „ip a“ if the infiniband interfaces are up. If not, check on PC1 if opensm is giving errors?
+ OK, if the connection is up, we can check the bandwidth:
+ on PC1:
+ ```$ ib_write_bw
+ ```
+ on PC2:
+ ```$ ib_write_bw 192.168.100.1
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
+ 65536 5000 3293.63 3293.56 0.052697
+ ```
+ and we can check the latency:
+ on PC1:
+ ```$ ib_write_lat
+ ```
+ on PC2:
+ ```$ ib_write_lat 192.168.100.1
+ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
+ 2 1000 1.10 2.05 1.11 1.12 0.00 1.19 2.05
+ ```
+ So around 1.12µs which is an expected value.
+
+ Next, follow the [AMD Strix Halo RDMA Cluster Setup Guide](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md)
+
+ To be continued, it's still work in progress.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9