Blame
|
1 | # Clustering with RDMA |
||||||
|
2 | |||||||
|
3 | With RDMA and low latencies like 1µs, tensor parallelism can provide a speedup. |
||||||
| 4 | Unfortunately, it's not yet possible using the USB4/Thunderbolt 3 ports of the Strix Halo. |
|||||||
| 5 | So we need some extra hardware: Network adapters that are able to offload the CPU for this task, connected via PCIe. |
|||||||
|
6 | |||||||
|
7 | ## Clustering with Oculink and PCIe 3.0 Infiniband cards |
||||||
|
8 | |||||||
|
9 | The two Bosgame M5 PCs used for this setup have neither an Oculink port nor a PCIe slot. So we use M.2 to Oculink adapters to get PCIe 4.0 x4 for the NICs. Here's some hardware used for a setup with cheap used Mellanox cards. The more recent PCIe 4.0 cards are quite a bit more expensive than the older cards. The PCIe 3.0 x4 connection limits the cards to speeds of around 26GBit/s. Not too shabby. |
||||||
|
10 | |||||||
| 11 | * 2x Strix Halo with a spare M.2 slot (tested using Bosgame M5) |
|||||||
|
12 | * 1x ATX PC PSU (any will do, needs just 20 Watts). I'm using a PicoPSU (20€). |
||||||
| 13 | * 2x Mellanox ConnectX-3 CX354A PCIe 3.0 x8 infiniband cards, used, 23€ each. |
|||||||
|
14 | * 1x DAC cable Mellanox 56G QSFP+ FDR InfiniBand DAC Copper Twinax Passiv 0.5m MC2207130-00A, used, 18€ [example link](https://www.ebay.de/itm/126922287689) |
||||||
| 15 | * 1x ATX PSU 24pin splitter cable [example link](https://a.aliexpress.com/_Ezm7My8) ($6 with coins) |
|||||||
|
16 | * 2x Oculink M.2 adapter, cable, PCIe 4.0 x16 slot [example link](https://a.aliexpress.com/_Ez9CgPK) (~$25 each with coins and coupons) |
||||||
|
17 | |||||||
|
18 | Total cost: 20€+46€+18€+49€ = 133€ Not bad! |
||||||
|
19 | |||||||
| 20 | What else is needed: |
|||||||
| 21 | ||||||||
| 22 | * a little 3d printed custom case for the two network cards |
|||||||
|
23 | * 2x 3d printed lids for the SSD compartment with a hole for the Oculink cable. Or you drill a hole in the original metal lids. |
||||||
|
24 | * a little fan to keep the Mellanox cards cool inside the case (they use up to 10W each) |
||||||
| 25 | ||||||||
| 26 | ### Quick howto: |
|||||||
| 27 | ||||||||
| 28 | 1. Connect Oculink M.2 adapters to the empty M.2 NVMe slots (1 per PC). |
|||||||
| 29 | 2. Plug Oculink cables into M.2 adapters and into PCIe 4.0 x16 slot adapters. |
|||||||
| 30 | 3. Plug 24pin PSU split cable into both PCIe 4.0 x16 slot adapters and into PSU. |
|||||||
| 31 | 4. Plug the two Mellanox cards into the PCIe slots |
|||||||
| 32 | 5. Connect the two Mellanox cards with the DAC cable |
|||||||
| 33 | 6. Using the switch on the PCIe 4.0 x16 slot adapter, turn on the PSU. |
|||||||
| 34 | 7. Finally, turn on the PCs. |
|||||||
| 35 | ||||||||
| 36 | Check if you can see the Mellanox cards in `lspci`: |
|||||||
| 37 | ||||||||
| 38 | ```$ lspci |
|||||||
| 39 | … |
|||||||
| 40 | c3:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] |
|||||||
| 41 | … |
|||||||
| 42 | ``` |
|||||||
| 43 | Make sure the NIC is connected via PCIe 3.0 x4: |
|||||||
| 44 | ```$ sudo lspci -vv -s c3:00.0 |grep -E "LnkCap:|LnkSta:" |
|||||||
| 45 | LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited |
|||||||
| 46 | LnkSta: Speed 8GT/s, Width x4 (downgraded) |
|||||||
| 47 | ``` |
|||||||
|
48 | It should also appear in your dmesg, like this: |
||||||
| 49 | ```$ sudo dmesg |grep mlx4 |
|||||||
| 50 | [ 2.762576] mlx4_core: Mellanox ConnectX core driver v4.0-0 |
|||||||
| 51 | [ 2.762587] mlx4_core: Initializing 0000:c3:00.0 |
|||||||
| 52 | [ 2.762633] mlx4_core 0000:c3:00.0: enabling device (0000 -> 0002) |
|||||||
| 53 | [ 9.162204] mlx4_core 0000:c3:00.0: DMFS high rate steer mode is: disabled performance optimized steering |
|||||||
| 54 | [ 9.162913] mlx4_core 0000:c3:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:00:02.5 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link) |
|||||||
| 55 | [ 9.402996] <mlx4_ib> mlx4_ib_probe: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0 |
|||||||
| 56 | [ 9.404284] <mlx4_ib> mlx4_ib_probe: counter index 0 for port 1 allocated 0 |
|||||||
| 57 | [ 9.404286] <mlx4_ib> mlx4_ib_probe: counter index 1 for port 2 allocated 0 |
|||||||
| 58 | [ 10.781441] mlx4_core 0000:c3:00.0 ibp195s0: renamed from ib0 |
|||||||
| 59 | [ 10.781830] mlx4_core 0000:c3:00.0 ibp195s0d1: renamed from ib1 |
|||||||
| 60 | [ 12.486493] mlx4_core 0000:c3:00.0 ibp195s0d1: "NetworkManager" wants to know my dev_id. Should it look at dev_port instead? See Documentation/ABI/testing/sysfs-class-net for more info. |
|||||||
| 61 | [ 1943.886040] mlx4_core 0000:c3:00.0 ibp195s0: Port: 1 Link INIT |
|||||||
| 62 | [ 1943.941515] mlx4_core 0000:c3:00.0 ibp195s0: Port: 1 Link ACTIVE |
|||||||
| 63 | ``` |
|||||||
| 64 | To enable performance optimized steering (and surrender VLAN support), edit |
|||||||
| 65 | `/etc/modprobe.d/mlx4.conf` and add this line: |
|||||||
| 66 | ```options mlx4_core log_num_mgm_entry_size=-7 |
|||||||
| 67 | ``` |
|||||||
| 68 | as mentioned in the [driver documentation](https://doc.dpdk.org/guides/nics/mlx4.html). |
|||||||
| 69 | ||||||||
|
70 | Install needed packages on both PCs running Fedora 43: |
||||||
| 71 | ```$ sudo dnf install rdma-core libibverbs-utils mstflint infiniband-diags perftest |
|||||||
| 72 | $ ibv_devinfo |
|||||||
| 73 | ``` |
|||||||
|
74 | look for "Link Layer", it should show Infiniband |
||||||
|
75 | |||||||
| 76 | On PC1 we start **opensm**, the Infiniband subnet manager and administration: |
|||||||
|
77 | ```$ sudo dnf install opensm |
||||||
| 78 | $ sudo systemctl enable --now opensm |
|||||||
| 79 | $ sudo restorecon -v /var/log/opensm.log |
|||||||
| 80 | ||||||||
| 81 | $ ibstat |
|||||||
| 82 | ``` |
|||||||
| 83 | now shows „State: Active“ on both PCs |
|||||||
| 84 | ||||||||
| 85 | PC1: |
|||||||
| 86 | ```$ ip a|grep -B 1 infini |
|||||||
| 87 | 4: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000 |
|||||||
| 88 | link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff |
|||||||
| 89 | 5: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000 |
|||||||
| 90 | link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:xx:xx:xx brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff |
|||||||
| 91 | ``` |
|||||||
| 92 | PC2: |
|||||||
| 93 | ```$ ip a|grep -B 1 infini |
|||||||
| 94 | 3: ibp195s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 1000 |
|||||||
| 95 | link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff |
|||||||
| 96 | 4: ibp195s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN group default qlen 1000 |
|||||||
| 97 | link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:yy:yy:yy brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff |
|||||||
| 98 | ``` |
|||||||
| 99 | So the interface name is **ibp195s0** on both PCs. |
|||||||
| 100 | ||||||||
| 101 | configure IPv4 on PC1: |
|||||||
| 102 | ```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.1/24 |
|||||||
| 103 | Verbindung »ib-conn« (e6655fba-ebd6-4ee5-a31b-9c25faacfe37) erfolgreich hinzugefügt. |
|||||||
| 104 | ``` |
|||||||
| 105 | configure IPv4 on PC2: |
|||||||
| 106 | ```$ sudo nmcli conn add type infiniband con-name ib-conn ifname ibp195s0 transport-mode datagram ipv4.method manual ipv4.addresses 192.168.100.2/24 |
|||||||
| 107 | $ sudo nmcli conn up ib-conn |
|||||||
| 108 | $ sudo nmcli conn show |
|||||||
| 109 | ``` |
|||||||
| 110 | PC1: (I also have a connection via Thunderbolt) |
|||||||
| 111 | ```$ sudo nmcli conn up ib-conn |
|||||||
| 112 | $ sudo nmcli conn show |
|||||||
| 113 | NAME UUID TYPE DEVICE |
|||||||
| 114 | Kabelgebundene Verbindung 1 1a44c330-8d06-34d6-9773-df0a34882a4b ethernet eno1 |
|||||||
| 115 | ib-conn e6655fba-ebd6-4ee5-a31b-9c25faacfe37 infiniband ibp195s0 |
|||||||
| 116 | thunderbolt0 7beaa789-b367-4810-ba22-3e946edab0fd ethernet thunderbolt0 |
|||||||
| 117 | ``` |
|||||||
| 118 | PC2: |
|||||||
| 119 | ```$ sudo nmcli conn show |
|||||||
| 120 | NAME UUID TYPE DEVICE |
|||||||
| 121 | Kabelgebundene Verbindung 1 dea9361f-0f51-3acf-9b85-04a35c116b67 ethernet eno1 |
|||||||
| 122 | ib-conn 5eaa86fe-99e7-48c9-b460-740d31adc936 infiniband ibp195s0 |
|||||||
| 123 | thunderbolt0 bd7e1a3c-f05d-3a43-bfc0-880fb874dba4 ethernet thunderbolt0 |
|||||||
| 124 | ``` |
|||||||
| 125 | Check with „ip a“ if the infiniband interfaces are up. If not, check on PC1 if opensm is giving errors? |
|||||||
|
126 | |||||||
|
127 | OK, if the connection is up, we can check the bandwidth: |
||||||
|
128 | |||||||
| 129 | On PC1: |
|||||||
|
130 | ```$ ib_write_bw |
||||||
| 131 | ``` |
|||||||
|
132 | On PC2: |
||||||
|
133 | ```$ ib_write_bw 192.168.100.1 |
||||||
| 134 | #bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps] |
|||||||
|
135 | 65536 5000 3293.63 3293.56 0.052697 |
||||||
| 136 | ``` |
|||||||
| 137 | and we can check the latency: |
|||||||
|
138 | |||||||
| 139 | On PC1: |
|||||||
|
140 | ```$ ib_write_lat |
||||||
| 141 | ``` |
|||||||
|
142 | On PC2: |
||||||
|
143 | ```$ ib_write_lat 192.168.100.1 |
||||||
| 144 | #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] |
|||||||
| 145 | 2 1000 1.10 2.05 1.11 1.12 0.00 1.19 2.05 |
|||||||
| 146 | ``` |
|||||||
|
147 | So around 1.12µs which is an expected value. Great! |
||||||
|
148 | |||||||
| 149 | Next, follow the [AMD Strix Halo RDMA Cluster Setup Guide](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md) |
|||||||
| 150 | ||||||||
| 151 | To be continued, it's still work in progress. |
|||||||