Blame
|
1 | # Fine-tuning eval |
||||||
| 2 | ||||||||
| 3 | If you're fine-tuning an LLM on Strix Halo with the HuggingFace Trainer and you've left `evaluation_strategy="steps"` enabled, every eval will peg all 32 logical cores at 100% for 5–10 minutes while the GPU sits at ~30–40% utilization. The training itself is fine; the eval is what breaks. |
|||||||
| 4 | ||||||||
| 5 | This is a kernel-level issue, not a Trainer bug. Skipping or moving eval out-of-process is the practical fix. |
|||||||
| 6 | ||||||||
| 7 | ## Symptom |
|||||||
| 8 | ||||||||
| 9 | During `Trainer.evaluate()`: |
|||||||
| 10 | ||||||||
| 11 | - All CPU cores pegged at 100%, sustained |
|||||||
| 12 | - GPU at modest utilization (30–40%) — eval is CPU-bound, not GPU-bound |
|||||||
| 13 | - `perf top` shows `flush_tlb_func` + `call_function_single_interrupt` dominating |
|||||||
| 14 | - Training steps don't trigger it; only eval |
|||||||
| 15 | ||||||||
| 16 | A 5–10 minute eval after every checkpoint makes long training runs infeasible — you spend more wall-clock time in eval than in training. |
|||||||
| 17 | ||||||||
| 18 | ## Why |
|||||||
| 19 | ||||||||
| 20 | Strix Halo's `/proc/cpuinfo` does NOT list the `invlpgb` flag, despite Zen 5 architecturally supporting it. Without `INVLPGB`, every TLB shootdown the kernel issues is a per-CPU IPI. Combined with the kernel's `mm/vmscan.c` reclaim path — which batches dirty folios into pagevecs of size `PAGEVEC_SIZE = 31` and fires `try_to_unmap_flush_dirty` every batch — the page churn during eval-mode forward passes generates thousands of IPIs/sec/core. All cores spend their cycles servicing each other's shootdowns instead of doing useful work. |
|||||||
| 21 | ||||||||
| 22 | Whether `INVLPGB` is intentionally not exposed on Strix Halo silicon, or whether it's a kernel/microcode detection gap, is still open. Filed as a question to AMD's kernel team at [ROCm/ROCm#6297](https://github.com/ROCm/ROCm/issues/6297) — if you have an AMD contact, point them at it. |
|||||||
| 23 | ||||||||
| 24 | > [!NOTE] |
|||||||
| 25 | > The same TLB-shootdown pattern hits any heavy-page-churn workload on Strix Halo, just less obviously. Eval is the most reliable trigger because the Trainer process churns through a lot of dirty pages in a short window. |
|||||||
| 26 | ||||||||
| 27 | ## Workaround — out-of-process eval via `llama-perplexity` |
|||||||
| 28 | ||||||||
| 29 | `llama-perplexity` (part of `llama.cpp`) loads the model fresh per invocation rather than holding it in a long-lived Python process that accumulates dirty pages. No accumulated pressure, no storm. The pattern: |
|||||||
| 30 | ||||||||
| 31 | 1. In your training script, disable in-process eval entirely: `evaluation_strategy="no"` |
|||||||
| 32 | 2. After every `save_steps` checkpoint, convert the LoRA adapter to GGUF (or merge into base and convert the merged model) |
|||||||
| 33 | 3. Run `llama-perplexity -m <base.gguf> --lora <adapter.gguf> -f <eval.txt>` against your held-out eval set |
|||||||
| 34 | 4. Parse the perplexity number out of the output, append to your own `eval_history.jsonl` |
|||||||
| 35 | ||||||||
| 36 | A working reference implementation of this pattern — orchestrator wiring, GGUF conversion, perplexity extraction, the lot — is documented at: |
|||||||
| 37 | ||||||||
| 38 | - [strix-halo-llm-finetune-guide — Step 7b "Storm-free eval via llama-perplexity"](https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide#step-7b--storm-free-eval-via-llama-perplexity) |
|||||||
| 39 | - Implementation: `scripts/eval_via_llama_perplexity.py` in the same repo |
|||||||
| 40 | ||||||||
| 41 | ## What's still open |
|||||||
| 42 | ||||||||
| 43 | - Is `INVLPGB` exposure on Strix Halo a kernel detection bug or a silicon-side decision? Tracking via [ROCm/ROCm#6297](https://github.com/ROCm/ROCm/issues/6297). If you have data on whether Strix Point (gfx1150) or Krackan Point (gfx1152) show the same `invlpgb: 0` in `/proc/cpuinfo`, that would help narrow it down. |
|||||||
| 44 | - Tuning `PAGEVEC_SIZE = 31` (structural in `include/linux/pagevec.h`) would proportionally cut the IPI rate on no-`INVLPGB` boxes. Real upstream `mm/` work; not packaged as a sysctl knob today. |
|||||||
| 45 | ||||||||
| 46 | ## Related |
|||||||
| 47 | ||||||||
| 48 | - [[llamacpp-with-ROCm|AI/llamacpp-with-ROCm]] — the ROCm build llama-perplexity comes from |
|||||||
| 49 | - [[AI_Capabilities_Overview|AI/AI_Capabilities_Overview]] — broader hardware/software context |
|||||||