What Happened

A Reddit user ran systematic benchmarks of 37 LLMs across 10 model families on a MacBook Air M5 (32GB, 10-core CPU/GPU) using llama-bench with Q4_K_M quantization. The test measured two key metrics: token generation speed (tg128, tokens/sec) and prompt processing speed (pp256, tokens/sec). The goal is to build a community-sourced database covering all Apple Silicon chips from M1 through M5.

Why It Matters

For indie developers and SMEs running local inference, hardware selection and model choice directly affect cost and user experience. Key findings from the data:

  • Qwen 3 0.6B hits 91.9 tok/s generation — fast enough for real-time UI applications with minimal RAM (0.6 GB).
  • Qwen 3.5 35B-A3B MoE runs at 31.3 tok/s using only 20.7 GB RAM — 12x faster than dense 32B models (2.5 tok/s) at comparable memory footprint.
  • Dense 32B models (QwQ, Qwen 3 32B, DeepSeek R1 Distill 32B) all plateau around 2.5–2.6 tok/s on this hardware, making them impractical for interactive use.
  • For coding tasks, Qwen 2.5 Coder 7B delivers 11 tok/s — a comfortable interactive speed — while the 14B variant drops to 6 tok/s for higher quality at slower pace.

The benchmark tool used is open-source llama-bench, meaning any developer can replicate these tests on their own hardware and contribute results to the community database.

Asia-Pacific Angle

Chinese-origin models dominate the top performance tiers in this benchmark. Qwen 3 (Alibaba) and Qwen 2.5 Coder claim the fastest generation speeds and best MoE efficiency. DeepSeek R1 Distill variants appear in both the fast and capable tiers. For developers in China and Southeast Asia building products that require on-device inference — whether for data privacy compliance, offline capability, or avoiding API costs — this data provides a direct hardware-to-model selection guide. The Qwen family's strong Q4_K_M quantization performance is particularly relevant for teams already familiar with Alibaba's model ecosystem. MoE architecture advantages shown here also validate the direction taken by domestic Chinese AI labs competing on efficiency rather than raw parameter count.

Action Item This Week

If you run local inference on Apple Silicon, install llama-bench via llama.cpp, run the Q4_K_M benchmark on Qwen 3.5 35B-A3B MoE against your current model choice, and measure whether the MoE speed advantage (31 tok/s vs ~2.5 tok/s for dense 32B) justifies switching — especially if your use case involves interactive response generation.