What Happened
A community benchmark published on r/LocalLLaMA compares Vulkan and ROCm backends for running Qwen 3.5 35B MoE models via LocalAI on AMD Ryzen AI MAX+ 395 hardware (Strix Halo). Two GGUF variants were tested: mudler/Qwen3.5-35B-A3B-APEX-I-Quality.gguf and unsloth/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf. Both are MoE models with 3B active parameters despite the 35B total parameter count.
The test harness was llama-benchy (run via uvx llama-benchy), with prefix caching enabled, generation latency mode, and adaptive prompts. Context depths covered were 0, 4K, 8K, 16K, 32K, 65K, 100K, and 200K tokens. The system runs Ubuntu 25.10 (kernel 6.19.10) with 118.1 GB shared GPU memory at 85W TDP — a configuration unique to the Strix Halo APU, which pools CPU and GPU memory.
Backend build identifiers: Vulkan at b8681, ROCm at b1232, CPU at b8681 — all via Lemonade 10.1.0. The divergent build numbers between ROCm and the others may partially explain performance differences beyond pure backend architecture.
Technical Deep Dive
The results split cleanly by workload type:
- Token generation (decode phase): Vulkan leads consistently. On APEX-I-Quality, Vulkan starts at ~57.5 t/s versus ROCm's ~50.0 t/s at zero context. At 100K tokens, Vulkan holds ~38.6 t/s to ROCm's ~35.7 t/s — roughly a 7-8% advantage maintained across context lengths.
- Prompt processing (prefill phase): ROCm dominates. On the unsloth variant, ROCm hits ~1052 t/s at 2K context against Vulkan's ~798 t/s. For APEX-I-Quality at 4K context, ROCm reaches ~885 t/s versus Vulkan's ~759 t/s — a ~16% gap.
This pattern is architecturally expected. Vulkan's compute pipeline on AMD iGPUs is often better tuned for memory-bandwidth-bound decode operations, where sequential token generation benefits from lower driver overhead. ROCm's HIP kernels, however, are optimized for high-throughput matrix multiplications that dominate the prefill phase, where large token batches are processed simultaneously.
The MoE architecture (3B active out of 35B) is important context: sparse activation means only a fraction of weights are loaded per token. On shared-memory APUs like Strix Halo, this reduces effective memory bandwidth pressure during decode — which may amplify Vulkan's dispatch efficiency advantage relative to discrete GPU scenarios where ROCm typically dominates end-to-end.
For comparison, on discrete AMD GPUs (e.g., RX 7900 XTX), ROCm generally wins both phases due to dedicated HBM bandwidth. Vulkan's advantage here appears specific to the unified-memory, lower-TDP APU context.
No CPU-only baseline was reported for comparison, which would have contextualized the GPU backends' speedup multiplier.
Who Should Care
This benchmark is directly relevant to anyone running local inference on AMD Ryzen AI MAX or similar APU-class hardware with unified memory pools. The 118 GB shared memory enables full 35B model loading without offloading — a use case that doesn't apply to most consumer GPUs.
ML engineers evaluating LocalAI deployment backends on AMD hardware should note the workload split: if your application is latency-sensitive for end users (chat, streaming responses), Vulkan gives better decode throughput. If you're batch-processing large prompts or documents (RAG ingestion, summarization pipelines), ROCm's prefill speed is the priority.
Developers building on Lemonade or llama.cpp-based stacks on Windows/Linux AMD APUs will find this directly applicable. Teams running inference at the edge on power-constrained hardware (85W TDP here) should also note that these numbers represent a competitive alternative to CPU-only inference at significantly higher throughput.
What To Do This Week
To replicate or extend these benchmarks on your own AMD hardware:
1. Install llama-benchy via uvx:
uvx llama-benchy --help2. Pull models via LocalAI with Lemonade 10.1.0+:
localai pull mudler/Qwen3.5-35B-A3B-APEX-GGUF:Qwen3.5-35B-A3B-APEX-I-Quality.gguf3. Run backend comparison:
uvx llama-benchy --backend vulkan --model qwen3.5-35b --prefix-cache --context 0,4096,8192,32768Switch --backend rocm for ROCm runs. Check Lemonade releases at github.com/mudler/LocalAI. If your workload is mixed, consider running Vulkan as default and benchmarking ROCm specifically for batch prefill tasks.