Blame
|
1 | # Hardware for clustering with RDMA |
||||||
|
2 | |||||||
| 3 | ## Clustering with Oculink and PCIe 3.0 Infiniband cards |
|||||||
| 4 | ||||||||
|
5 | The more recent PCIe 4.0 cards are quite a bit more expensive than the older cards. The PCIe 3.0 x4 connection limits the cards to speeds of around 26GBit/s. Not too shabby. |
||||||
|
6 | |||||||
| 7 | Here's some hardware used for a setup with cheap used Mellanox cards: |
|||||||
| 8 | ||||||||
| 9 | * 2x Strix Halo with a spare M.2 slot (tested using Bosgame M5) |
|||||||
| 10 | * 1x ATX PC PSU (any will do, needs just 20 Watts) |
|||||||
| 11 | * 2x Mellanox ConnectX-3 CX354A PCIe 3.0 x8 infiniband cards, used, 23€ each [example link](https://www.ebay.de/itm/177760210929?_skw=cx354a&epid=7043214331&itmmeta=01KK55RMHKERWWE085D2FZ4C5G&hash=item2963558ff1:g:u1MAAeSw7MRpYSrx) |
|||||||
| 12 | * 1x DAC cable Mellanox 56G QSFP+ FDR InfiniBand DAC Copper Twinax Passiv 0.5m MC2207130-00A, used, 18€ [example link](https://www.ebay.de/itm/126922287689) |
|||||||
| 13 | * 1x ATX PSU 24pin splitter cable [example link](https://a.aliexpress.com/_Ezm7My8) ($6 with coins) |
|||||||
| 14 | * 2x Oculink M.2 adapter, capble, PCIe 4.0 x16 slot [example link](https://a.aliexpress.com/_Ez9CgPK) (~$25 each with coins and coupons) |
|||||||
| 15 | ||||||||
| 16 | Total cost: 46€+18€+49€ =113€ Not bad! |
|||||||
| 17 | ||||||||
| 18 | What else is needed: |
|||||||
| 19 | ||||||||
| 20 | * a little 3d printed custom case for the two network cards |
|||||||
| 21 | * 2x 3d printed covers for the SSD compartment with a hole for the Oculink cable. Or you drill a whole in the original metal covers. |
|||||||
| 22 | * a little fan to keep the Mellanox cards cool inside the case (they use up to 10W each) |
|||||||
| 23 | ||||||||
| 24 | ### Quick howto: |
|||||||
| 25 | ||||||||
| 26 | 1. Connect Oculink M.2 adapters to the empty M.2 NVMe slots (1 per PC). |
|||||||
| 27 | 2. Plug Oculink cables into M.2 adapters and into PCIe 4.0 x16 slot adapters. |
|||||||
| 28 | 3. Plug 24pin PSU split cable into both PCIe 4.0 x16 slot adapters and into PSU. |
|||||||
| 29 | 4. Plug the two Mellanox cards into the PCIe slots |
|||||||
| 30 | 5. Connect the two Mellanox cards with the DAC cable |
|||||||
| 31 | 6. Using the switch on the PCIe 4.0 x16 slot adapter, turn on the PSU. |
|||||||
| 32 | 7. Finally, turn on the PCs. |
|||||||
| 33 | ||||||||
| 34 | Check if you can see the Mellanox cards in `lspci`: |
|||||||
| 35 | ||||||||
| 36 | ```$ lspci |
|||||||
| 37 | … |
|||||||
| 38 | c3:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] |
|||||||
| 39 | … |
|||||||
| 40 | ``` |
|||||||
| 41 | Make sure the NIC is connected via PCIe 3.0 x4: |
|||||||
| 42 | ```$ sudo lspci -vv -s c3:00.0 |grep -E "LnkCap:|LnkSta:" |
|||||||
| 43 | LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited |
|||||||
| 44 | LnkSta: Speed 8GT/s, Width x4 (downgraded) |
|||||||
| 45 | ``` |
|||||||
| 46 | Install needed packages on both PCs running Fedora 43: |
|||||||
| 47 | ```$ sudo dnf install rdma-core libibverbs-utils mstflint infiniband-diags perftest |
|||||||
| 48 | $ ibv_devinfo |
|||||||
| 49 | ``` |
|||||||
| 50 | look for „Link Layer“, it should show Infiniband |
|||||||
|
51 | |||||||
| 52 | On PC1 we start **opensm**, the Infiniband subnet manager and administration: |
|||||||
|
53 | ```$ sudo dnf install opensm |
||||||
| 54 | $ sudo systemctl enable --now opensm |
|||||||
| 55 | $ sudo restorecon -v /var/log/opensm.log |
|||||||
| 56 | ||||||||
| 57 | $ ibstat |
|||||||
| 58 | ``` |
|||||||
| 59 | now shows „State: Active“ on both PCs |
|||||||
| 60 | ||||||||
| 61 | PC1: |
|||||||
| 62 | ```$ ip a|grep -B 1 infini |
|||||||
| 63 | 4: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000 |
|||||||
| 64 | link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff |
|||||||
| 65 | 5: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000 |
|||||||
| 66 | link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff |
|||||||
| 67 | ``` |
|||||||
| 68 | PC2: |
|||||||
| 69 | ```$ ip a|grep -B 1 infini |
|||||||
| 70 | 3: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000 |
|||||||
| 71 | link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff |
|||||||
| 72 | 4: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000 |
|||||||
| 73 | link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff |
|||||||
| 74 | ``` |
|||||||
| 75 | So the interface name is **ibp195s0** on both PCs. |
|||||||
| 76 | ||||||||
| 77 | configure IPv4 on PC1: |
|||||||
| 78 | ```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.1/24 |
|||||||
| 79 | Verbindung »ib-conn« (e6655fba-ebd6-4ee5-a31b-9c25faacfe37) erfolgreich hinzugefügt. |
|||||||
| 80 | ``` |
|||||||
| 81 | configure IPv4 on PC2: |
|||||||
| 82 | ```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.2/24 |
|||||||
| 83 | $ sudo nmcli conn up ib-conn |
|||||||
| 84 | $ sudo nmcli conn show |
|||||||
| 85 | ``` |
|||||||
| 86 | PC1: (I also have a connection via Thunderbolt) |
|||||||
| 87 | ```$ sudo nmcli conn up ib-conn |
|||||||
| 88 | $ sudo nmcli conn show |
|||||||
| 89 | NAME UUID TYPE DEVICE |
|||||||
| 90 | Kabelgebundene Verbindung 1 1a44c330-8d06-34d6-9773-df0a34882a4b ethernet eno1 |
|||||||
| 91 | ib-conn e6655fba-ebd6-4ee5-a31b-9c25faacfe37 infiniband ibp195s0 |
|||||||
| 92 | thunderbolt0 7beaa789-b367-4810-ba22-3e946edab0fd ethernet thunderbolt0 |
|||||||
| 93 | ``` |
|||||||
| 94 | PC2: |
|||||||
| 95 | ```$ sudo nmcli conn show |
|||||||
| 96 | NAME UUID TYPE DEVICE |
|||||||
| 97 | Kabelgebundene Verbindung 1 dea9361f-0f51-3acf-9b85-04a35c116b67 ethernet eno1 |
|||||||
| 98 | ib-conn 5eaa86fe-99e7-48c9-b460-740d31adc936 infiniband ibp195s0 |
|||||||
| 99 | thunderbolt0 bd7e1a3c-f05d-3a43-bfc0-880fb874dba4 ethernet thunderbolt0 |
|||||||
| 100 | ``` |
|||||||
| 101 | Check with „ip a“ if the infiniband interfaces are up. If not, check on PC1 if opensm is giving errors? |
|||||||
|
102 | |||||||
|
103 | OK, if the connection is up, we can check the bandwidth: |
||||||
|
104 | |||||||
| 105 | On PC1: |
|||||||
|
106 | ```$ ib_write_bw |
||||||
| 107 | ``` |
|||||||
|
108 | On PC2: |
||||||
|
109 | ```$ ib_write_bw 192.168.100.1 |
||||||
| 110 | #bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps] |
|||||||
| 111 | 65536 5000 3293.63 3293.56 0.052697 |
|||||||
| 112 | ``` |
|||||||
|
113 | and we can check the latency: |
||||||
| 114 | ||||||||
|
115 | On PC1: |
||||||
| 116 | ```$ ib_write_lat |
|||||||
|
117 | ``` |
||||||
|
118 | On PC2: |
||||||
| 119 | ```$ ib_write_lat 192.168.100.1 |
|||||||
| 120 | #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] |
|||||||
| 121 | 2 1000 1.10 2.05 1.11 1.12 0.00 1.19 2.05 |
|||||||
| 122 | ``` |
|||||||
| 123 | So around 1.12µs which is an expected value. |
|||||||
| 124 | ||||||||
| 125 | Next, follow the [AMD Strix Halo RDMA Cluster Setup Guide](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md) |
|||||||
| 126 | ||||||||