Blame

98d373 Blake Hamm 2026-03-17 19:06:49
feat(k8s) initial version of kubernetes docs
1
# Kubernetes
a29d5b Blake Hamm 2026-03-18 01:31:41
feat(k8s) some tweaks to intro
2
Running Kubernetes on a Strix Halo machine allows you to build a powerful node for local AI inference. Because standard Kubernetes GPU operators are designed for discrete PCIe cards, Strix Halo requires a few platform-specific tweaks to enable reliable GPU scheduling and monitoring.
98d373 Blake Hamm 2026-03-17 19:06:49
feat(k8s) initial version of kubernetes docs
3
a29d5b Blake Hamm 2026-03-18 01:31:41
feat(k8s) some tweaks to intro
4
This guide shows how to configure a Strix Halo system with Talos Linux, the ROCm GPU Operator, and custom Node Feature Discovery rules so Kubernetes workloads can access the iGPU cleanly.
e4a70f Blake Hamm 2026-03-18 01:27:50
feat(k8s) enhance intro paragraph
5
6
By the end of this guide, you will have configured:
7
8
- Optimized Node Provisioning: A Talos Linux installation with the required AMD drivers and memory allocation kernel arguments.
9
- Native GPU Scheduling: A fully functional AMD ROCm GPU Operator that allows pods to request GPU resources (e.g., amd.com/gpu: 1).
10
- Reliable Hardware Discovery: Custom NFD rules tailored to detect the Strix Halo iGPU via kernel modules rather than traditional device IDs.
11
- Verified Acceleration: A successful PyTorch benchmark running on the GPU, complete with Prometheus metrics for monitoring VRAM and power usage.
98d373 Blake Hamm 2026-03-17 19:06:49
feat(k8s) initial version of kubernetes docs
12
13
## Talos
14
15
If you are planning to access your Strix Halo machine for AI workloads, I would recommend building a Talos image with this schematic:
16
17
```yaml
18
customization:
19
extraKernelArgs:
20
- amd_iommu=off
21
- amdgpu.gttsize=126976
22
- amdgpu.vm_fragment_size=8
23
- ttm.pages_limit=32505856
24
- ttm.page_pool_size=25165824
25
systemExtensions:
26
officialExtensions:
27
- siderolabs/amd-ucode
28
- siderolabs/amdgpu
29
- siderolabs/thunderbolt # Optional: for Framework USB-C/Thunderbolt support
30
```
31
32
It's necessary to have the AMD extensions to ensure you have proper drivers, but the kernel args are optional and depend on your workload type (AI vs general compute).
33
34
More info on how to build your container image with a custom schematic can be found [here](https://docs.siderolabs.com/talos/v1.10/learn-more/image-factory).
35
36
Once you boot up the system with these initial arguments, you should be able to access the Talos live interface. Here is a command to ensure the GPU is accessible:
37
38
```bash
39
# Check if AMD GPU kernel module is loaded
40
talosctl -n <node-ip> get kernelmodulestatus amdgpu
41
```
42
43
## ROCm GPU Operator
44
45
The AMD GPU operator gives you access the underlying Strix Halo hardware with resource requests/limits like so:
46
47
```yaml
48
resources:
49
limits:
50
amd.com/gpu: 1
51
```
52
53
To enable this, we first install the gpu operator. Then, we ensure nodes are properly labeled so that the gpu operator recognizes them. Finally, we can test a gpu workload on the Strix Halo node and confirm the gpu is accessible.
54
55
### Helm Installation
56
57
First, we need to install the ROCm GPU Operator:
58
59
```bash
60
# Add Helm repo
61
helm repo add rocm https://rocm.github.io/gpu-operator
62
helm repo update
63
64
# Install operator
65
helm install rocm-operator rocm/gpu-operator-charts \
66
-f values.yaml --namespace amd --create-namespace
67
```
68
69
#### Helm values
70
71
Here are the key configuration options for the Helm chart:
72
73
```yaml
74
# Enable Kernel Module Management (KMM) for driver management
75
kmm:
76
enabled: true
77
78
# Install default NFD rules
79
installdefaultNFDRule: true
80
81
# Device configuration
82
deviceConfig:
83
spec:
84
# Driver settings - disable in-cluster driver since Talos manages it
85
# Talos already includes the AMDGPU kernel module, so we don't need
86
# the operator to install drivers
87
driver:
88
enable: false
89
blacklist: false
90
```
91
92
> **Note**: Disabling `driver.enable` is important for Talos deployments since Talos already includes the AMDGPU kernel module via the `siderolabs/amdgpu` extension. Setting this to `false` means the operator will use the pre-installed kernel module rather than trying to install drivers in-cluster.
93
94
### Node Feature Discovery
95
96
Node Feature Discovery (NFD) automatically detects hardware features and labels nodes. The ROCm GPU Operator uses these labels to identify GPU nodes.
97
98
For Strix Halo machines, use this rule to detect the `amdgpu` kernel module:
99
100
```yaml
101
apiVersion: nfd.k8s-sigs.io/v1alpha1
102
kind: NodeFeatureRule
103
metadata:
104
name: amdgpu-rules
105
spec:
106
rules:
107
- name: "amdgpu kernel module"
108
labels:
109
feature.node.kubernetes.io/amd-gpu: "true"
110
matchFeatures:
111
- feature: "kernel.loadedmodule"
112
matchExpressions:
113
amdgpu: {op: Exists}
114
```
115
116
```bash
117
# Apply rules
118
kubectl apply -f nfd-rule.yaml
119
```
120
121
This rule will automatically label any node with the `amdgpu` kernel module loaded with the label `feature.node.kubernetes.io/amd-gpu: "true"`.
122
123
> **Strix Halo Note**: This NFD rule checks for the `amdgpu` kernel module rather than device IDs, which works reliably with the Strix Halo iGPU where traditional device-based detection may fail.
124
125
### Post-installation Verification
126
127
Verify the operator is running and GPUs are detected:
128
129
```bash
130
# Check operator pods
131
kubectl get pods -n amd
132
133
# Verify GPU resources are allocatable
134
kubectl describe nodes | grep -E "(amd\.com/gpu|amd-gpu)"
135
136
# Confirm GPU labels are applied
137
kubectl get nodes --show-labels | grep amd-gpu
138
```
139
140
### PyTorch GPU Test
141
142
Once the operator is deployed, you can test GPU access with a PyTorch workload:
143
144
```yaml
145
apiVersion: batch/v1
146
kind: Job
147
metadata:
148
name: pytorch-gpu-demo
149
namespace: amd
150
spec:
151
template:
152
spec:
153
restartPolicy: OnFailure
154
containers:
155
- name: pytorch-gpu-container
156
image: rocm/pytorch:latest
157
command: ["/bin/bash", "-c", "--"]
158
args:
159
- |
160
rocm-smi
161
git clone https://github.com/ROCm/pytorch-micro-benchmarking.git
162
cd pytorch-micro-benchmarking
163
python micro_benchmarking_pytorch.py --network resnet50 --compile
164
resources:
165
limits:
166
amd.com/gpu: 1
167
```
168
169
Save this as `pytorch-demo.yaml` and apply:
170
171
```bash
172
kubectl apply -f pytorch-demo.yaml
173
```
174
175
Monitor the job:
176
```bash
177
kubectl logs -n amd job/pytorch-gpu-demo -f
178
```
179
180
**Expected output**: You should see `rocm-smi` displaying your GPU information (GPU name, temperature, power usage, etc.), followed by benchmark results showing the ResNet50 training progressing with GPU acceleration. If successful, you'll see lines indicating "GPU is available" and training throughput metrics.
181
182
### Monitoring
183
184
The ROCm GPU Operator includes a metrics exporter that automatically exposes GPU metrics for Prometheus scraping. Add the following to your Helm values to enable monitoring:
185
186
```yaml
187
metricsExporter:
188
enable: true
189
prometheus:
190
serviceMonitor:
191
enable: true
192
interval: 30s
193
```
194
195
When enabled, the exporter provides standard AMD GPU metrics including:
196
- GPU health status (`gpu_health`)
197
- VRAM usage (`gpu_used_vram`, `gpu_total_vram`)
198
- GTT (Graphics Translation Table) memory usage (`gpu_used_gtt`, `gpu_total_gtt`)
199
- GPU temperature and power consumption
200
- Per-container GPU memory allocation
201
202
These metrics can be visualized in Grafana using standard ROCm GPU dashboards. The Strix Halo iGPU exposes these metrics through the standard ROCm interface without requiring additional configuration.
203
204
### Additional Resources
205
206
- [ROCm GPU Operator Documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/index.html)
207
- [ROCm GPU Operator Helm Chart](https://github.com/ROCm/gpu-operator/tree/main/helm-charts-k8s)
208
- [Node Feature Discovery Documentation](https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/index.html)
209
- [Talos Linux Documentation](https://www.talos.dev/)