commit 509067

Commit `509067`

2025-11-15 06:50:23 lhl: AI section massaging

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

`Guides/Buyers-Guide.md` ..
@@ 7,18 7,18 @@

	#### AI Stuff

-	Strix Halo APUs are mainly considered to be a cheap way to run big LLMs and they indeed can do it fairly well, but not without some caveats. Most limitations come from memory bandwidth constraints. MoE (Mixture of Experts) models work much faster than dense ones, and while there are more and more MoE models coming out, you might still end up needing to use a dense model with severely lacking performance.
+	One of Strix Halo's main selling points is its 128GB of unified memory. This makes it able to run some bigger AI models, and it can be one of the best price/performance ways to do that, with some important caveats. While it has plenty of memory accessible to a relatively capable GPU, the memory bandwidth is relatively low (256 GB/s), which means that while large MoE (Mixture of Expert) LLMs like `gpt-oss-120b` can run well, large dense models will run extremely slowly. From a compute and memory bandwidth perspective, if a model fits entirely in VRAM, then even relatively low-end GPUs (eg, an RTX 3060) will outperform Strix Halo.

-	The same goes for context size - prompt processing and text generation speeds take a major hit as context grows. Depending on model size and architecture, going over 64k, 32k, and in some cases even 16k of context becomes painful. Big context sizes could still work fairly well if the context grows gradually (like when you're using the model in chat mode), but don't expect miracles with massive document processing.
+	Context size is also something to keep in mind: prompt processing and text generation speeds take a major hit as context grows. This is true for all hardware, but Strix Halo takes a worse relative hit vs NVIDIA GPUs. Depending on model size and architecture, going over 64k, 32k, and in some cases even 16k of context can mean very long waits, and much slower performance. If you are having chat conversations and the prior context is cached, then you may not be as heavily impacted, but for agentic or document processing use cases, you will need to run these non-interactively (or expect long waits).

-	As an example, here are some charts illustrating the drop of speed with the growth of context size:
+	Here is an example of the performance drop that happens as context size grows:
	![](./strix-halo-pp-tg-ctx.png)

-	General rule: look at benchmarks with the specific model and context size you're interested in. If you can't find them, ask in [our Discord](https://discord.gg/pnPRyucNrG) for someone to test it for you. This is important - don't give in to hype only to be disappointed later.
+	General rule: look at benchmarks with the specific model and context size you're interested in. The [[AI section\|AI/AI-Capabilities-Overview]] of our wiki has links to some community-generated benchmarks, but if you can't find your specific model, ask in [our Discord](https://discord.gg/pnPRyucNrG) for someone to test it for you. This is important - don't give in to hype only to be disappointed later.

-	More on the AI topic - image and video generation is just slow. Again, depending on your use cases this might or might not be a problem, but be aware that the typical experience would probably be "set some tasks in the evening, come back to check results in the morning".
+	So far, we've just been talking about text models, which Strix Halo handles relatively well, but it's important to note that image and video generation is just slow (and sometimes entirely broken!). Most pipelines are optimized for CUDA, and while AMD (ROCm, Vulkan) support has been getting better, software support and optimization is still a big issue.

-	But the real elephant in the room is that AMD software support is lacking as always. It's been nearly a year since the platform was introduced, yet there are still problems with stability (especially on ROCm) and performance. The NPUs are still mostly unused, the SDKs are filled with bugs, and AMD's level of involvement is mediocre at best. The situation has been improving slightly over the last several months, but the whole ecosystem is mostly driven by community effort. The overall experience is indeed worse than what you could theoretically get with NVIDIA products.
+	This last point is more generally true as well. Despite having launched in early 2025, and despite being advertised as an "AI" system, AMD's software support is still relatively lacking. ROCm (AMD's CUDA equivalent) support is still relatively slow and unoptimized for Strix Halo, and basic AI support like PyTorch, vLLM, Flash Attention have sharp edges or may not work at all! The Strix Halo NPUs are also largely unused (and in Linux, basically unusable) and while there have been some great community efforts, sadly the community has also been often left to "fend for themselves" both when it comes to official first-party (AMD) or third party support. While things are improving, if you want something that "just works" for AI you may be better off paying the NVIDIA tax.

	#### Non-AI Stuff

@@ 30,10 30,9 @@

	With that said, my personal opinion about the best use case is basically a kind of "jack of all trades" system. You get full x86 compatibility and something that can do anything and everything, albeit not at the highest possible levels and while being kinda pricey (especially with prices rising since November 2025).

-
	## Choosing the Right Configuration

-	There are four main types of Strix Halo-based devices: mini PCs, AIOs, laptops, and handhelds. Laptops and handhelds are typically power and cooling limited, so their performance is lacking as well. However, the difference between 55W and 120W power limits isn't that massive ([[5-35% depending on the task\|Guides/Power-Modes-and-Performance]]). This guide is mostly centered on mini PCs since this seems to be the most popular option anyway.
+	There are four main types of Strix Halo-based devices: mini PCs, AIOs, laptops, and handhelds. Laptops and handhelds are typically power and cooling limited, so their performance is generally lower. However, the difference between 55W and 120W power limits isn't that massive ([[5-35% depending on the task\|Guides/Power-Modes-and-Performance]]). This guide is mostly centered on mini PCs since this seems to be the most popular option.

	There are several different models of Strix Halo APUs available:
	![Strix Halo APU Specs](./strix-halo-specs.png)

Commit 509067

Commit `509067`