What Happened

A developer posting as /u/No_Shift_4543 on r/LocalLLaMA open-sourced DFlash, a speculative decoding implementation for Apple Silicon running on MLX. Benchmar ked on an M5 Max with 64GB unified memory and MLX 0.31.1, the system delivers a 4.13x throughput increase on Qwen3.5-9B — from 30 .96 tok/s baseline to 127.07 tok/s — with a 89.4% token acceptance rate at 2048 tokens. The project requires no MLX fork and runs on stock mlx_lm.

The release follows a revised benchmark methodology that replaced a custom Python loop with the stock mlx_lm.stream_generate baseline, which the author acknowledges had previously inflated speedup figures. Results are median of 3 runs with a 10 -second cooldown between tests.

Why It Matters

Speculative decoding has been a server-side optimization story — NVIDIA hardware, vLLM, large data centers. This implementation targets the growing segment of engineers running inference locally on Apple Silicon, where the constraints are fundament ally different: unified memory, bandwidth-bound execution, and no discrete V RAM. The 4.1x figure on sub-10B models is meaningful for interactive coding assistants and local agent loops where latency is the primary bottleneck, not throughput at scale.

The 89% acceptance rate — up from roughly 82% in the prior version — is the operative number for production viability. Below 80%, speculative decoding overhead erodes gains on longer generations. The improvement came from numerical precision fixes in bf16 paths, not architectural changes, which suggests the prior benchmark was understating both the baseline and the overhead .

The differential between model sizes is a critical cav eat for anyone evaluating this for deployment. The 27B-4bit model achieves only 1.90x speedup (32.35 → 62.78 tok/s), and the 35B-A3B-4bit model reaches 1.69x (142. 12 → 240.21 tok/s). The author attributes this directly to a structural limitation: when the quantized target model is already fast, the bf16 draft model becomes the bottleneck. Engineers targeting large quant ized models should temper expectations accordingly.

The Technical Detail

DFlash uses a small draft model to generate 16 tokens in parallel via block diffusion. The target model verifies all 16 in a single forward pass. Only tokens verified against the target are committed — the process is described as lossless by the author.

The architecture is specifically optimized for Qwen3.5's hybrid GatedDeltaNet + attention design. The key enabling mechanism is a tape -replay rollback: a custom Metal kernel that replays only accepted steps through the GatedDeltaNet recurrent state, avoiding a full checkpoint save and restore cycle. This is what the author credits for maintaining acceptance rates over long generations. Pure attention models (Qwen3, Gemma) are supported but do not benefit from tape-replay.

Notable engineering finding: on unified memory hardware, custom Metal kernels for batched-GEMV, fused gated SiLU , and custom SDPA all performed slower than stock MLX equivalents. The author concludes that on bandwidth-bound silicon, precision and algorithmic correctness matter more than compute-level optimization — a counter -intuitive result for engineers accustomed to GPU kernel tuning.

Additional implementation details:

  • JIT 2-pass SDPA kernel for long-context verification at sequence lengths ≥ 1024 tokens
  • Numerically stable bf16 paths across speculative cycles
  • Full results at 1024, 2048, and 4096 token lengths available in the repository
  • No custom ML X fork required — runs on stock MLX 0.31.1

Benchmark Summary (2048 tokens, M5 Max 64GB)

  • Qwen3.5-4B: 53.74 → 219.83 tok/s (4.10x, 89.3% acceptance)
  • Qwen3.5-9 B: 30.96 → 127.07 tok/s (4.13x, 89.4% acceptance)
  • Qwen3.5-27B-4bit: 32.35 → 62.78 tok /s (1.90x, 89.1% acceptance)
  • Qwen3.5-35B-A3B-4bit: 142.12 → 240.21 tok/s (1.69x, 88.7 % acceptance)

What To Watch

The author's stated roadmap includes full-attention model optimization and draft model compression — both of which would directly address the two main limitations identified: poor gains on quantized targets and lack of tape-replay benefits for non -GatedDeltaNet architectures.

Watch for MLX upstream response. Apple's MLX team has historically been responsive to community contributions that demonstrate significant throughput gains. If DFlash's approach proves reproducible, elements of the tape-replay mechanism or the block-diffusion draft strategy could surface in official MLX releases within 30- 60 days.

The Qwen3.5 model family is the critical dependency here . Alibaba's continued development of GatedDeltaNet hybrid architectures will determine whether this optimization remains relevant or requires architectural rework. Any changes to Qwen's recurrent state design in future model versions would require corresponding updates to the tape-replay kernel.

Community reproduction on M3 and M4 hardware is the immediate validation signal. The M5 Max represents the high end of current Apple Silicon; results on M3 Pro or M4 with smaller unified memory would clarify whether bandwidth constraints scale line arly or whether the optimization has a memory-size floor .