Blame

98d373 Blake Hamm 2026-03-17 19:06:49
feat(k8s) initial version of kubernetes docs
1
# Kubernetes
2
e4a70f Blake Hamm 2026-03-18 01:27:50
feat(k8s) enhance intro paragraph
3
Running Kubernetes on a Strix Halo machine allows you to build a powerful, cloud-native node for local AI inference. However, because standard Kubernetes GPU operators are designed for discrete PCIe cards, getting the Strix Halo iGPU properly recognized by your cluster requires specific kernel arguments and custom hardware discovery rules.
4
5
This guide walks you through configuring a Strix Halo machine for a Kubernetes environment using Talos Linux.
6
7
What You Will Achieve
8
By the end of this guide, you will have configured:
9
10
- Optimized Node Provisioning: A Talos Linux installation with the required AMD drivers and memory allocation kernel arguments.
11
- Native GPU Scheduling: A fully functional AMD ROCm GPU Operator that allows pods to request GPU resources (e.g., amd.com/gpu: 1).
12
- Reliable Hardware Discovery: Custom NFD rules tailored to detect the Strix Halo iGPU via kernel modules rather than traditional device IDs.
13
- Verified Acceleration: A successful PyTorch benchmark running on the GPU, complete with Prometheus metrics for monitoring VRAM and power usage.
98d373 Blake Hamm 2026-03-17 19:06:49
feat(k8s) initial version of kubernetes docs
14
15
## Talos
16
17
If you are planning to access your Strix Halo machine for AI workloads, I would recommend building a Talos image with this schematic:
18
19
```yaml
20
customization:
21
extraKernelArgs:
22
- amd_iommu=off
23
- amdgpu.gttsize=126976
24
- amdgpu.vm_fragment_size=8
25
- ttm.pages_limit=32505856
26
- ttm.page_pool_size=25165824
27
systemExtensions:
28
officialExtensions:
29
- siderolabs/amd-ucode
30
- siderolabs/amdgpu
31
- siderolabs/thunderbolt # Optional: for Framework USB-C/Thunderbolt support
32
```
33
34
It's necessary to have the AMD extensions to ensure you have proper drivers, but the kernel args are optional and depend on your workload type (AI vs general compute).
35
36
More info on how to build your container image with a custom schematic can be found [here](https://docs.siderolabs.com/talos/v1.10/learn-more/image-factory).
37
38
Once you boot up the system with these initial arguments, you should be able to access the Talos live interface. Here is a command to ensure the GPU is accessible:
39
40
```bash
41
# Check if AMD GPU kernel module is loaded
42
talosctl -n <node-ip> get kernelmodulestatus amdgpu
43
```
44
45
## ROCm GPU Operator
46
47
The AMD GPU operator gives you access the underlying Strix Halo hardware with resource requests/limits like so:
48
49
```yaml
50
resources:
51
limits:
52
amd.com/gpu: 1
53
```
54
55
To enable this, we first install the gpu operator. Then, we ensure nodes are properly labeled so that the gpu operator recognizes them. Finally, we can test a gpu workload on the Strix Halo node and confirm the gpu is accessible.
56
57
### Helm Installation
58
59
First, we need to install the ROCm GPU Operator:
60
61
```bash
62
# Add Helm repo
63
helm repo add rocm https://rocm.github.io/gpu-operator
64
helm repo update
65
66
# Install operator
67
helm install rocm-operator rocm/gpu-operator-charts \
68
-f values.yaml --namespace amd --create-namespace
69
```
70
71
#### Helm values
72
73
Here are the key configuration options for the Helm chart:
74
75
```yaml
76
# Enable Kernel Module Management (KMM) for driver management
77
kmm:
78
enabled: true
79
80
# Install default NFD rules
81
installdefaultNFDRule: true
82
83
# Device configuration
84
deviceConfig:
85
spec:
86
# Driver settings - disable in-cluster driver since Talos manages it
87
# Talos already includes the AMDGPU kernel module, so we don't need
88
# the operator to install drivers
89
driver:
90
enable: false
91
blacklist: false
92
```
93
94
> **Note**: Disabling `driver.enable` is important for Talos deployments since Talos already includes the AMDGPU kernel module via the `siderolabs/amdgpu` extension. Setting this to `false` means the operator will use the pre-installed kernel module rather than trying to install drivers in-cluster.
95
96
### Node Feature Discovery
97
98
Node Feature Discovery (NFD) automatically detects hardware features and labels nodes. The ROCm GPU Operator uses these labels to identify GPU nodes.
99
100
For Strix Halo machines, use this rule to detect the `amdgpu` kernel module:
101
102
```yaml
103
apiVersion: nfd.k8s-sigs.io/v1alpha1
104
kind: NodeFeatureRule
105
metadata:
106
name: amdgpu-rules
107
spec:
108
rules:
109
- name: "amdgpu kernel module"
110
labels:
111
feature.node.kubernetes.io/amd-gpu: "true"
112
matchFeatures:
113
- feature: "kernel.loadedmodule"
114
matchExpressions:
115
amdgpu: {op: Exists}
116
```
117
118
```bash
119
# Apply rules
120
kubectl apply -f nfd-rule.yaml
121
```
122
123
This rule will automatically label any node with the `amdgpu` kernel module loaded with the label `feature.node.kubernetes.io/amd-gpu: "true"`.
124
125
> **Strix Halo Note**: This NFD rule checks for the `amdgpu` kernel module rather than device IDs, which works reliably with the Strix Halo iGPU where traditional device-based detection may fail.
126
127
### Post-installation Verification
128
129
Verify the operator is running and GPUs are detected:
130
131
```bash
132
# Check operator pods
133
kubectl get pods -n amd
134
135
# Verify GPU resources are allocatable
136
kubectl describe nodes | grep -E "(amd\.com/gpu|amd-gpu)"
137
138
# Confirm GPU labels are applied
139
kubectl get nodes --show-labels | grep amd-gpu
140
```
141
142
### PyTorch GPU Test
143
144
Once the operator is deployed, you can test GPU access with a PyTorch workload:
145
146
```yaml
147
apiVersion: batch/v1
148
kind: Job
149
metadata:
150
name: pytorch-gpu-demo
151
namespace: amd
152
spec:
153
template:
154
spec:
155
restartPolicy: OnFailure
156
containers:
157
- name: pytorch-gpu-container
158
image: rocm/pytorch:latest
159
command: ["/bin/bash", "-c", "--"]
160
args:
161
- |
162
rocm-smi
163
git clone https://github.com/ROCm/pytorch-micro-benchmarking.git
164
cd pytorch-micro-benchmarking
165
python micro_benchmarking_pytorch.py --network resnet50 --compile
166
resources:
167
limits:
168
amd.com/gpu: 1
169
```
170
171
Save this as `pytorch-demo.yaml` and apply:
172
173
```bash
174
kubectl apply -f pytorch-demo.yaml
175
```
176
177
Monitor the job:
178
```bash
179
kubectl logs -n amd job/pytorch-gpu-demo -f
180
```
181
182
**Expected output**: You should see `rocm-smi` displaying your GPU information (GPU name, temperature, power usage, etc.), followed by benchmark results showing the ResNet50 training progressing with GPU acceleration. If successful, you'll see lines indicating "GPU is available" and training throughput metrics.
183
184
### Monitoring
185
186
The ROCm GPU Operator includes a metrics exporter that automatically exposes GPU metrics for Prometheus scraping. Add the following to your Helm values to enable monitoring:
187
188
```yaml
189
metricsExporter:
190
enable: true
191
prometheus:
192
serviceMonitor:
193
enable: true
194
interval: 30s
195
```
196
197
When enabled, the exporter provides standard AMD GPU metrics including:
198
- GPU health status (`gpu_health`)
199
- VRAM usage (`gpu_used_vram`, `gpu_total_vram`)
200
- GTT (Graphics Translation Table) memory usage (`gpu_used_gtt`, `gpu_total_gtt`)
201
- GPU temperature and power consumption
202
- Per-container GPU memory allocation
203
204
These metrics can be visualized in Grafana using standard ROCm GPU dashboards. The Strix Halo iGPU exposes these metrics through the standard ROCm interface without requiring additional configuration.
205
206
### Additional Resources
207
208
- [ROCm GPU Operator Documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/index.html)
209
- [ROCm GPU Operator Helm Chart](https://github.com/ROCm/gpu-operator/tree/main/helm-charts-k8s)
210
- [Node Feature Discovery Documentation](https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/index.html)
211
- [Talos Linux Documentation](https://www.talos.dev/)