Blame

98d373 Blake Hamm 2026-03-17 19:06:49
feat(k8s) initial version of kubernetes docs
1
# Kubernetes
2
3
The Strix Halo machine works great on Kubernetes and allows container access to the underlying hardware. This has been tested with Talos using the GPU ROCm operator.
4
5
## Talos
6
7
If you are planning to access your Strix Halo machine for AI workloads, I would recommend building a Talos image with this schematic:
8
9
```yaml
10
customization:
11
extraKernelArgs:
12
- amd_iommu=off
13
- amdgpu.gttsize=126976
14
- amdgpu.vm_fragment_size=8
15
- ttm.pages_limit=32505856
16
- ttm.page_pool_size=25165824
17
systemExtensions:
18
officialExtensions:
19
- siderolabs/amd-ucode
20
- siderolabs/amdgpu
21
- siderolabs/thunderbolt # Optional: for Framework USB-C/Thunderbolt support
22
```
23
24
It's necessary to have the AMD extensions to ensure you have proper drivers, but the kernel args are optional and depend on your workload type (AI vs general compute).
25
26
More info on how to build your container image with a custom schematic can be found [here](https://docs.siderolabs.com/talos/v1.10/learn-more/image-factory).
27
28
Once you boot up the system with these initial arguments, you should be able to access the Talos live interface. Here is a command to ensure the GPU is accessible:
29
30
```bash
31
# Check if AMD GPU kernel module is loaded
32
talosctl -n <node-ip> get kernelmodulestatus amdgpu
33
```
34
35
## ROCm GPU Operator
36
37
The AMD GPU operator gives you access the underlying Strix Halo hardware with resource requests/limits like so:
38
39
```yaml
40
resources:
41
limits:
42
amd.com/gpu: 1
43
```
44
45
To enable this, we first install the gpu operator. Then, we ensure nodes are properly labeled so that the gpu operator recognizes them. Finally, we can test a gpu workload on the Strix Halo node and confirm the gpu is accessible.
46
47
### Helm Installation
48
49
First, we need to install the ROCm GPU Operator:
50
51
```bash
52
# Add Helm repo
53
helm repo add rocm https://rocm.github.io/gpu-operator
54
helm repo update
55
56
# Install operator
57
helm install rocm-operator rocm/gpu-operator-charts \
58
-f values.yaml --namespace amd --create-namespace
59
```
60
61
#### Helm values
62
63
Here are the key configuration options for the Helm chart:
64
65
```yaml
66
# Enable Kernel Module Management (KMM) for driver management
67
kmm:
68
enabled: true
69
70
# Install default NFD rules
71
installdefaultNFDRule: true
72
73
# Device configuration
74
deviceConfig:
75
spec:
76
# Driver settings - disable in-cluster driver since Talos manages it
77
# Talos already includes the AMDGPU kernel module, so we don't need
78
# the operator to install drivers
79
driver:
80
enable: false
81
blacklist: false
82
```
83
84
> **Note**: Disabling `driver.enable` is important for Talos deployments since Talos already includes the AMDGPU kernel module via the `siderolabs/amdgpu` extension. Setting this to `false` means the operator will use the pre-installed kernel module rather than trying to install drivers in-cluster.
85
86
### Node Feature Discovery
87
88
Node Feature Discovery (NFD) automatically detects hardware features and labels nodes. The ROCm GPU Operator uses these labels to identify GPU nodes.
89
90
For Strix Halo machines, use this rule to detect the `amdgpu` kernel module:
91
92
```yaml
93
apiVersion: nfd.k8s-sigs.io/v1alpha1
94
kind: NodeFeatureRule
95
metadata:
96
name: amdgpu-rules
97
spec:
98
rules:
99
- name: "amdgpu kernel module"
100
labels:
101
feature.node.kubernetes.io/amd-gpu: "true"
102
matchFeatures:
103
- feature: "kernel.loadedmodule"
104
matchExpressions:
105
amdgpu: {op: Exists}
106
```
107
108
```bash
109
# Apply rules
110
kubectl apply -f nfd-rule.yaml
111
```
112
113
This rule will automatically label any node with the `amdgpu` kernel module loaded with the label `feature.node.kubernetes.io/amd-gpu: "true"`.
114
115
> **Strix Halo Note**: This NFD rule checks for the `amdgpu` kernel module rather than device IDs, which works reliably with the Strix Halo iGPU where traditional device-based detection may fail.
116
117
### Post-installation Verification
118
119
Verify the operator is running and GPUs are detected:
120
121
```bash
122
# Check operator pods
123
kubectl get pods -n amd
124
125
# Verify GPU resources are allocatable
126
kubectl describe nodes | grep -E "(amd\.com/gpu|amd-gpu)"
127
128
# Confirm GPU labels are applied
129
kubectl get nodes --show-labels | grep amd-gpu
130
```
131
132
### PyTorch GPU Test
133
134
Once the operator is deployed, you can test GPU access with a PyTorch workload:
135
136
```yaml
137
apiVersion: batch/v1
138
kind: Job
139
metadata:
140
name: pytorch-gpu-demo
141
namespace: amd
142
spec:
143
template:
144
spec:
145
restartPolicy: OnFailure
146
containers:
147
- name: pytorch-gpu-container
148
image: rocm/pytorch:latest
149
command: ["/bin/bash", "-c", "--"]
150
args:
151
- |
152
rocm-smi
153
git clone https://github.com/ROCm/pytorch-micro-benchmarking.git
154
cd pytorch-micro-benchmarking
155
python micro_benchmarking_pytorch.py --network resnet50 --compile
156
resources:
157
limits:
158
amd.com/gpu: 1
159
```
160
161
Save this as `pytorch-demo.yaml` and apply:
162
163
```bash
164
kubectl apply -f pytorch-demo.yaml
165
```
166
167
Monitor the job:
168
```bash
169
kubectl logs -n amd job/pytorch-gpu-demo -f
170
```
171
172
**Expected output**: You should see `rocm-smi` displaying your GPU information (GPU name, temperature, power usage, etc.), followed by benchmark results showing the ResNet50 training progressing with GPU acceleration. If successful, you'll see lines indicating "GPU is available" and training throughput metrics.
173
174
### Monitoring
175
176
The ROCm GPU Operator includes a metrics exporter that automatically exposes GPU metrics for Prometheus scraping. Add the following to your Helm values to enable monitoring:
177
178
```yaml
179
metricsExporter:
180
enable: true
181
prometheus:
182
serviceMonitor:
183
enable: true
184
interval: 30s
185
```
186
187
When enabled, the exporter provides standard AMD GPU metrics including:
188
- GPU health status (`gpu_health`)
189
- VRAM usage (`gpu_used_vram`, `gpu_total_vram`)
190
- GTT (Graphics Translation Table) memory usage (`gpu_used_gtt`, `gpu_total_gtt`)
191
- GPU temperature and power consumption
192
- Per-container GPU memory allocation
193
194
These metrics can be visualized in Grafana using standard ROCm GPU dashboards. The Strix Halo iGPU exposes these metrics through the standard ROCm interface without requiring additional configuration.
195
196
### Additional Resources
197
198
- [ROCm GPU Operator Documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/index.html)
199
- [ROCm GPU Operator Helm Chart](https://github.com/ROCm/gpu-operator/tree/main/helm-charts-k8s)
200
- [Node Feature Discovery Documentation](https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/index.html)
201
- [Talos Linux Documentation](https://www.talos.dev/)