LLM Inference

3 articles tagged with this topic

Microsoft 4x LLM Inference: AI's Second Half Is Cutting Infra Costs

At NSDI 2026, Microsoft unveils AI infra breakthroughs like 4x LLM inference via cache sharing. AI competition shifts from scaling parameters to infra

May 52 min read

DeepSeekKV Cache

80M Tokens for 4 RMB: DeepSeek Disk Cache Rewrites LLM Inference Costs

DeepSeek's novel architecture enables disk-level caching, slashing API costs 10x. This signals LLM inference shifting from raw compute to engineering

May 42 min read

NvidiaDGX Spark

16 Nvidia DGX Spark Units Clustered for LLMs — Enterprise Compute Focus Shifts to VRAM

Reddit user clusters 16 Nvidia DGX Spark units, runs 434GB LLM. Unified memory validated. Inference bottlenecks shift from compute to VRAM — new path

May 12 min read