Kubernetes - blame (e4a70f) – Strix Halo Wiki

Blame

98d373	Blake Hamm	2026-03-17 19:06:49
feat(k8s) initial version of kubernetes docs

# Kubernetes

e4a70f	Blake Hamm	2026-03-18 01:27:50
feat(k8s) enhance intro paragraph

Running Kubernetes on a Strix Halo machine allows you to build a powerful, cloud-native node for local AI inference. However, because standard Kubernetes GPU operators are designed for discrete PCIe cards, getting the Strix Halo iGPU properly recognized by your cluster requires specific kernel arguments and custom hardware discovery rules.

This guide walks you through configuring a Strix Halo machine for a Kubernetes environment using Talos Linux.

What You Will Achieve

By the end of this guide, you will have configured:

- Optimized Node Provisioning: A Talos Linux installation with the required AMD drivers and memory allocation kernel arguments.

- Native GPU Scheduling: A fully functional AMD ROCm GPU Operator that allows pods to request GPU resources (e.g., amd.com/gpu: 1).

- Reliable Hardware Discovery: Custom NFD rules tailored to detect the Strix Halo iGPU via kernel modules rather than traditional device IDs.

- Verified Acceleration: A successful PyTorch benchmark running on the GPU, complete with Prometheus metrics for monitoring VRAM and power usage.

98d373	Blake Hamm	2026-03-17 19:06:49
feat(k8s) initial version of kubernetes docs

## Talos

If you are planning to access your Strix Halo machine for AI workloads, I would recommend building a Talos image with this schematic:

```yaml

customization:

  extraKernelArgs:

    - amd_iommu=off

    - amdgpu.gttsize=126976

    - amdgpu.vm_fragment_size=8

    - ttm.pages_limit=32505856

    - ttm.page_pool_size=25165824

  systemExtensions:

    officialExtensions:

      - siderolabs/amd-ucode

      - siderolabs/amdgpu

      - siderolabs/thunderbolt  # Optional: for Framework USB-C/Thunderbolt support

```

It's necessary to have the AMD extensions to ensure you have proper drivers, but the kernel args are optional and depend on your workload type (AI vs general compute).

More info on how to build your container image with a custom schematic can be found [here](https://docs.siderolabs.com/talos/v1.10/learn-more/image-factory).

Once you boot up the system with these initial arguments, you should be able to access the Talos live interface. Here is a command to ensure the GPU is accessible:

```bash

# Check if AMD GPU kernel module is loaded

talosctl -n <node-ip> get kernelmodulestatus amdgpu

```

## ROCm GPU Operator

The AMD GPU operator gives you access the underlying Strix Halo hardware with resource requests/limits like so:

```yaml

resources:

  limits:

    amd.com/gpu: 1

```

To enable this, we first install the gpu operator. Then, we ensure nodes are properly labeled so that the gpu operator recognizes them. Finally, we can test a gpu workload on the Strix Halo node and confirm the gpu is accessible.

### Helm Installation

First, we need to install the ROCm GPU Operator:

```bash

# Add Helm repo

helm repo add rocm https://rocm.github.io/gpu-operator

helm repo update

# Install operator

helm install rocm-operator rocm/gpu-operator-charts \

  -f values.yaml --namespace amd --create-namespace

```

#### Helm values

Here are the key configuration options for the Helm chart:

```yaml

# Enable Kernel Module Management (KMM) for driver management

kmm:

  enabled: true

# Install default NFD rules

installdefaultNFDRule: true

# Device configuration

deviceConfig:

  spec:    

# Driver settings - disable in-cluster driver since Talos manages it

# Talos already includes the AMDGPU kernel module, so we don't need

# the operator to install drivers

    driver:

      enable: false

      blacklist: false

```

> **Note**: Disabling `driver.enable` is important for Talos deployments since Talos already includes the AMDGPU kernel module via the `siderolabs/amdgpu` extension. Setting this to `false` means the operator will use the pre-installed kernel module rather than trying to install drivers in-cluster.

### Node Feature Discovery

Node Feature Discovery (NFD) automatically detects hardware features and labels nodes. The ROCm GPU Operator uses these labels to identify GPU nodes. 

100

For Strix Halo machines, use this rule to detect the `amdgpu` kernel module:

101

102

```yaml

103

apiVersion: nfd.k8s-sigs.io/v1alpha1

104

kind: NodeFeatureRule

105

metadata:

106

  name: amdgpu-rules

107

spec:

108

  rules:

109

    - name: "amdgpu kernel module"

110

      labels:

111

        feature.node.kubernetes.io/amd-gpu: "true"

112

      matchFeatures:

113

        - feature: "kernel.loadedmodule"

114

          matchExpressions:

115

            amdgpu: {op: Exists}

116

```

117

118

```bash

119

# Apply rules

120

kubectl apply -f nfd-rule.yaml

121

```

122

123

This rule will automatically label any node with the `amdgpu` kernel module loaded with the label `feature.node.kubernetes.io/amd-gpu: "true"`.

124

125

> **Strix Halo Note**: This NFD rule checks for the `amdgpu` kernel module rather than device IDs, which works reliably with the Strix Halo iGPU where traditional device-based detection may fail.

126

127

### Post-installation Verification

128

129

Verify the operator is running and GPUs are detected:

130

131

```bash

132

# Check operator pods

133

kubectl get pods -n amd

134

135

# Verify GPU resources are allocatable

136

kubectl describe nodes | grep -E "(amd\.com/gpu|amd-gpu)"

137

138

# Confirm GPU labels are applied

139

kubectl get nodes --show-labels | grep amd-gpu

140

```

141

142

### PyTorch GPU Test

143

144

Once the operator is deployed, you can test GPU access with a PyTorch workload:

145

146

```yaml

147

apiVersion: batch/v1

148

kind: Job

149

metadata:

150

  name: pytorch-gpu-demo

151

  namespace: amd

152

spec:

153

  template:

154

    spec:

155

      restartPolicy: OnFailure

156

      containers:

157

        - name: pytorch-gpu-container

158

          image: rocm/pytorch:latest

159

          command: ["/bin/bash", "-c", "--"]

160

          args:

161

- |

162

rocm-smi

163

git clone https://github.com/ROCm/pytorch-micro-benchmarking.git

164

cd pytorch-micro-benchmarking

165

python micro_benchmarking_pytorch.py --network resnet50 --compile

166

          resources:

167

            limits:

168

              amd.com/gpu: 1

169

```

170

171

Save this as `pytorch-demo.yaml` and apply:

172

173

```bash

174

kubectl apply -f pytorch-demo.yaml

175

```

176

177

Monitor the job:

178

```bash

179

kubectl logs -n amd job/pytorch-gpu-demo -f

180

```

181

182

**Expected output**: You should see `rocm-smi` displaying your GPU information (GPU name, temperature, power usage, etc.), followed by benchmark results showing the ResNet50 training progressing with GPU acceleration. If successful, you'll see lines indicating "GPU is available" and training throughput metrics.

183

184

### Monitoring

185

186

The ROCm GPU Operator includes a metrics exporter that automatically exposes GPU metrics for Prometheus scraping. Add the following to your Helm values to enable monitoring:

187

188

```yaml

189

metricsExporter:

190

  enable: true

191

  prometheus:

192

    serviceMonitor:

193

      enable: true

194

      interval: 30s

195

```

196

197

When enabled, the exporter provides standard AMD GPU metrics including:

198

- GPU health status (`gpu_health`)

199

- VRAM usage (`gpu_used_vram`, `gpu_total_vram`)

200

- GTT (Graphics Translation Table) memory usage (`gpu_used_gtt`, `gpu_total_gtt`)

201

- GPU temperature and power consumption

202

- Per-container GPU memory allocation

203

204

These metrics can be visualized in Grafana using standard ROCm GPU dashboards. The Strix Halo iGPU exposes these metrics through the standard ROCm interface without requiring additional configuration.

205

206

### Additional Resources

207

208

- [ROCm GPU Operator Documentation](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/index.html)

209

- [ROCm GPU Operator Helm Chart](https://github.com/ROCm/gpu-operator/tree/main/helm-charts-k8s)

210

- [Node Feature Discovery Documentation](https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/index.html)

211

- [Talos Linux Documentation](https://www.talos.dev/)