What Happened
Developer Subhadip Mitra published a fused Mixture-of-Experts (MoE) dispatch kernel written entirely in Triton that reduces the forward pass from 24+ kernel launches to 5. On an A100 with Mixtral-8x7B, it achieves 5.8x speedup over PyTorch baseline and runs 24% faster than Stanford's Megablocks at batch size 128. The kernel fuses the gate and up projections so both GEMMs share the same input tile from L2 cache, with SiLU activation computed in registers, saving approximately 470MB of memory traffic per forward pass. At batch size 512 and above, Megablocks' hand-tuned block-sparse matmul regains the lead. The same code runs on AMD MI300X with zero modifications, passing all 162 tests.
Why It Matters
Most production inference serving operates at batch sizes of 32–128 tokens, exactly where this kernel wins. Indie developers and SMEs deploying Mixtral-8x7B, DeepSeek-V3, or Qwen2-MoE can drop this kernel in and immediately reduce GPU cost per token without waiting for upstream framework updates. The pure Triton implementation means no CUDA-specific dependencies, lowering the barrier compared to hand-written CUDA kernels like those in Megablocks.
- 5.8x over PyTorch at batch 128 on A100
- 24% faster than Megablocks at the same batch size
- Supports Mixtral-8x7B, DeepSeek-V3 (256 experts), and Qwen2-MoE
- AMD MI300X compatible with no code changes
Asia-Pacific Angle
Chinese and Southeast Asian developers building on DeepSeek-V3 or Qwen2-MoE have a direct path to lower inference costs. DeepSeek-V3 uses 256 experts, a configuration explicitly tested and confirmed working here. Teams running self-hosted inference on domestic cloud providers using MI300X-equivalent hardware (common in Chinese data centers due to export restrictions on A100s) benefit from the AMD compatibility. The roofline analysis in the full writeup provides a framework for tuning the kernel to specific hardware memory bandwidth ratios, which matters when deploying on non-A100 GPU clusters prevalent across Southeast Asian cloud providers.
Action Item This Week
Clone github.com/bassrehab/triton-kernels, run the provided benchmark script against your current Mixtral or Qwen2-MoE serving setup at your actual production batch size, and compare latency numbers before committing to integration.