Blame
|
1 | # Kubernetes |
||||||
|
2 | Running Kubernetes on a Strix Halo machine allows you to build a powerful node for local AI inference. Because standard Kubernetes GPU operators are designed for discrete PCIe cards, Strix Halo requires a few platform-specific tweaks to enable reliable GPU scheduling and monitoring. |
||||||
|
3 | |||||||
|
4 | This guide shows how to configure a Strix Halo system with Talos Linux, the ROCm GPU Operator, and custom Node Feature Discovery rules so Kubernetes workloads can access the iGPU cleanly. |
||||||
|
5 | |||||||
| 6 | By the end of this guide, you will have configured: |
|||||||
| 7 | ||||||||
| 8 | - Optimized Node Provisioning: A Talos Linux installation with the required AMD drivers and memory allocation kernel arguments. |
|||||||
| 9 | - Native GPU Scheduling: A fully functional AMD ROCm GPU Operator that allows pods to request GPU resources (e.g., amd.com/gpu: 1). |
|||||||
| 10 | - Reliable Hardware Discovery: Custom NFD rules tailored to detect the Strix Halo iGPU via kernel modules rather than traditional device IDs. |
|||||||
| 11 | - Verified Acceleration: A successful PyTorch benchmark running on the GPU, complete with Prometheus metrics for monitoring VRAM and power usage. |
|||||||
|
12 | |||||||
| 13 | ## Talos |
|||||||
| 14 | ||||||||
| 15 | If you are planning to access your Strix Halo machine for AI workloads, I would recommend building a Talos image with this schematic: |
|||||||
| 16 | ||||||||
| 17 | ```yaml |
|||||||
| 18 | customization: |
|||||||
| 19 | extraKernelArgs: |
|||||||
| 20 | - amd_iommu=off |
|||||||
| 21 | - amdgpu.gttsize=126976 |
|||||||
| 22 | - amdgpu.vm_fragment_size=8 |
|||||||
| 23 | - ttm.pages_limit=32505856 |
|||||||
| 24 | - ttm.page_pool_size=25165824 |
|||||||
| 25 | systemExtensions: |
|||||||
| 26 | officialExtensions: |
|||||||
| 27 | - siderolabs/amd-ucode |
|||||||
| 28 | - siderolabs/amdgpu |
|||||||
| 29 | - siderolabs/thunderbolt # Optional: for Framework USB-C/Thunderbolt support |
|||||||
| 30 | ``` |
|||||||
| 31 | ||||||||
| 32 | It's necessary to have the AMD extensions to ensure you have proper drivers, but the kernel args are optional and depend on your workload type (AI vs general compute). |
|||||||
| 33 | ||||||||
| 34 | More info on how to build your container image with a custom schematic can be found [here](https://docs.siderolabs.com/talos/v1.10/learn-more/image-factory). |
|||||||
| 35 | ||||||||
| 36 | Once you boot up the system with these initial arguments, you should be able to access the Talos live interface. Here is a command to ensure the GPU is accessible: |
|||||||
| 37 | ||||||||
| 38 | ```bash |
|||||||
| 39 | # Check if AMD GPU kernel module is loaded |
|||||||
| 40 | talosctl -n <node-ip> get kernelmodulestatus amdgpu |
|||||||
| 41 | ``` |
|||||||
| 42 | ||||||||
| 43 | ## ROCm GPU Operator |
|||||||
| 44 | ||||||||
| 45 | The AMD GPU operator gives you access the underlying Strix Halo hardware with resource requests/limits like so: |
|||||||
| 46 | ||||||||
| 47 | ```yaml |
|||||||
| 48 | resources: |
|||||||
| 49 | limits: |
|||||||
| 50 | amd.com/gpu: 1 |
|||||||
| 51 | ``` |
|||||||
| 52 | ||||||||
| 53 | To enable this, we first install the gpu operator. Then, we ensure nodes are properly labeled so that the gpu operator recognizes them. Finally, we can test a gpu workload on the Strix Halo node and confirm the gpu is accessible. |
|||||||
| 54 | ||||||||
| 55 | ### Helm Installation |
|||||||
| 56 | ||||||||
| 57 | First, we need to install the ROCm GPU Operator: |
|||||||
| 58 | ||||||||
| 59 | ```bash |
|||||||
| 60 | # Add Helm repo |
|||||||
| 61 | helm repo add rocm https://rocm.github.io/gpu-operator |
|||||||
| 62 | helm repo update |
|||||||
| 63 | ||||||||
| 64 | # Install operator |
|||||||
| 65 | helm install rocm-operator rocm/gpu-operator-charts \ |
|||||||
| 66 | -f values.yaml --namespace amd --create-namespace |
|||||||
| 67 | ``` |
|||||||
| 68 | ||||||||
| 69 | #### Helm values |
|||||||
| 70 | ||||||||
| 71 | Here are the key configuration options for the Helm chart: |
|||||||
| 72 | ||||||||
| 73 | ```yaml |
|||||||
| 74 | # Enable Kernel Module Management (KMM) for driver management |
|||||||
| 75 | kmm: |
|||||||
| 76 | enabled: true |
|||||||
| 77 | ||||||||
| 78 | # Install default NFD rules |
|||||||
| 79 | installdefaultNFDRule: true |
|||||||
| 80 | ||||||||
| 81 | # Device configuration |
|||||||
| 82 | deviceConfig: |
|||||||
| 83 | spec: |
|||||||
| 84 | # Driver settings - disable in-cluster driver since Talos manages it |
|||||||
| 85 | # Talos already includes the AMDGPU kernel module, so we don't need |
|||||||
| 86 | # the operator to install drivers |
|||||||
| 87 | driver: |
|||||||
| 88 | enable: false |
|||||||
| 89 | blacklist: false |
|||||||
| 90 | ``` |
|||||||
| 91 | ||||||||
| 92 | > **Note**: Disabling `driver.enable` is important for Talos deployments since Talos already includes the AMDGPU kernel module via the `siderolabs/amdgpu` extension. Setting this to `false` means the operator will use the pre-installed kernel module rather than trying to install drivers in-cluster. |
|||||||
| 93 | ||||||||
| 94 | ### Node Feature Discovery |
|||||||
| 95 | ||||||||
| 96 | Node Feature Discovery (NFD) automatically detects hardware features and labels nodes. The ROCm GPU Operator uses these labels to identify GPU nodes. |
|||||||
| 97 | ||||||||
| 98 | For Strix Halo machines, use this rule to detect the `amdgpu` kernel module: |
|||||||
| 99 | ||||||||
| 100 | ```yaml |
|||||||
| 101 | apiVersion: nfd.k8s-sigs.io/v1alpha1 |
|||||||
| 102 | kind: NodeFeatureRule |
|||||||
| 103 | metadata: |
|||||||
| 104 | name: amdgpu-rules |
|||||||
| 105 | spec: |
|||||||
| 106 | rules: |
|||||||
| 107 | - name: "amdgpu kernel module" |
|||||||
| 108 | labels: |
|||||||
| 109 | feature.node.kubernetes.io/amd-gpu: "true" |
|||||||
| 110 | matchFeatures: |
|||||||
| 111 | - feature: "kernel.loadedmodule" |
|||||||
| 112 | matchExpressions: |
|||||||
| 113 | amdgpu: {op: Exists} |
|||||||
| 114 | ``` |
|||||||
| 115 | ||||||||
| 116 | ```bash |
|||||||
| 117 | # Apply rules |
|||||||
| 118 | kubectl apply -f nfd-rule.yaml |
|||||||
| 119 | ``` |
|||||||
| 120 | ||||||||
| 121 | This rule will automatically label any node with the `amdgpu` kernel module loaded with the label `feature.node.kubernetes.io/amd-gpu: "true"`. |
|||||||
| 122 | ||||||||
| 123 | > **Strix Halo Note**: This NFD rule checks for the `amdgpu` kernel module rather than device IDs, which works reliably with the Strix Halo iGPU where traditional device-based detection may fail. |
|||||||
| 124 | ||||||||
| 125 | ### Post-installation Verification |
|||||||
| 126 | ||||||||
| 127 | Verify the operator is running and GPUs are detected: |
|||||||
| 128 | ||||||||
| 129 | ```bash |
|||||||
| 130 | # Check operator pods |
|||||||
| 131 | kubectl get pods -n amd |
|||||||
| 132 | ||||||||
| 133 | # Verify GPU resources are allocatable |
|||||||
| 134 | kubectl describe nodes | grep -E "(amd\.com/gpu|amd-gpu)" |
|||||||
| 135 | ||||||||
| 136 | # Confirm GPU labels are applied |
|||||||
| 137 | kubectl get nodes --show-labels | grep amd-gpu |
|||||||
| 138 | ``` |
|||||||
| 139 | ||||||||
| 140 | ### PyTorch GPU Test |
|||||||
| 141 | ||||||||
| 142 | Once the operator is deployed, you can test GPU access with a PyTorch workload: |
|||||||
| 143 | ||||||||
| 144 | ```yaml |
|||||||
| 145 | apiVersion: batch/v1 |
|||||||
| 146 | kind: Job |
|||||||
| 147 | metadata: |
|||||||
| 148 | name: pytorch-gpu-demo |
|||||||
| 149 | namespace: amd |
|||||||
| 150 | spec: |
|||||||
| 151 | template: |
|||||||
| 152 | spec: |
|||||||
| 153 | restartPolicy: OnFailure |
|||||||
| 154 | containers: |
|||||||
| 155 | - name: pytorch-gpu-container |
|||||||
| 156 | image: rocm/pytorch:latest |
|||||||
| 157 | command: ["/bin/bash", "-c", "--"] |
|||||||
| 158 | args: |
|||||||
| 159 | - | |
|||||||
| 160 | rocm-smi |
|||||||
| 161 | git clone https://github.com/ROCm/pytorch-micro-benchmarking.git |
|||||||
| 162 | cd pytorch-micro-benchmarking |
|||||||
| 163 | python micro_benchmarking_pytorch.py --network resnet50 --compile |
|||||||
| 164 | resources: |
|||||||
| 165 | limits: |
|||||||
| 166 | amd.com/gpu: 1 |
|||||||
| 167 | ``` |
|||||||
| 168 | ||||||||
| 169 | Save this as `pytorch-demo.yaml` and apply: |
|||||||
| 170 | ||||||||
| 171 | ```bash |
|||||||
| 172 | kubectl apply -f pytorch-demo.yaml |
|||||||
| 173 | ``` |
|||||||
| 174 | ||||||||
| 175 | Monitor the job: |
|||||||
| 176 | ```bash |
|||||||
| 177 | kubectl logs -n amd job/pytorch-gpu-demo -f |
|||||||
| 178 | ``` |
|||||||
| 179 | ||||||||
| 180 | **Expected output**: You should see `rocm-smi` displaying your GPU information (GPU name, temperature, power usage, etc.), followed by benchmark results showing the ResNet50 training progressing with GPU acceleration. If successful, you'll see lines indicating "GPU is available" and training throughput metrics. |
|||||||
| 181 | ||||||||
| 182 | ### Monitoring |
|||||||
| 183 | ||||||||
| 184 | The ROCm GPU Operator includes a metrics exporter that automatically exposes GPU metrics for Prometheus scraping. Add the following to your Helm values to enable monitoring: |
|||||||
| 185 | ||||||||
| 186 | ```yaml |
|||||||
| 187 | metricsExporter: |
|||||||
| 188 | enable: true |
|||||||
| 189 | prometheus: |
|||||||
| 190 | serviceMonitor: |
|||||||
| 191 | enable: true |
|||||||
| 192 | interval: 30s |
|||||||
| 193 | ``` |
|||||||
| 194 | ||||||||
| 195 | When enabled, the exporter provides standard AMD GPU metrics including: |
|||||||
| 196 | - GPU health status (`gpu_health`) |
|||||||
| 197 | - VRAM usage (`gpu_used_vram`, `gpu_total_vram`) |
|||||||
| 198 | - GTT (Graphics Translation Table) memory usage (`gpu_used_gtt`, `gpu_total_gtt`) |
|||||||
| 199 | - GPU temperature and power consumption |
|||||||
| 200 | - Per-container GPU memory allocation |
|||||||
| 201 | ||||||||
| 202 | These metrics can be visualized in Grafana using standard ROCm GPU dashboards. The Strix Halo iGPU exposes these metrics through the standard ROCm interface without requiring additional configuration. |
|||||||
| 203 | ||||||||
| 204 | ### Additional Resources |
|||||||
| 205 | ||||||||
| 206 | - [ROCm GPU Operator Documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/index.html) |
|||||||
| 207 | - [ROCm GPU Operator Helm Chart](https://github.com/ROCm/gpu-operator/tree/main/helm-charts-k8s) |
|||||||
| 208 | - [Node Feature Discovery Documentation](https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/index.html) |
|||||||
| 209 | - [Talos Linux Documentation](https://www.talos.dev/) |
|||||||