In a recent complex coding test, Xiaomi MiMo 2.5 Pro consumed over 6 times the tokens (the smallest unit of text processed by a model) compared to its peers, yet generated unrunnable junk code. This indicates that the benchmark era for large language models is passing, and real delivery capability is the true watershed.
What This Is
A developer designed a comprehensive coding task—a "martial arts sect management simulator"—having mainstream domestic LLMs generate complete code directly without external tools. Results showed DeepSeek V4 generated a fully functional program with a normal UI using about 22,000 tokens; Kimi completed the task with under 10,000. Xiaomi MiMo 2.5 Pro, however, failed to finish at a 32,000-token limit, then at 64,000, and barely outputted anything when the limit was raised to 128,000, ultimately consuming over 60,000 tokens. Worse, MiMo's code contained basic concatenation errors and was completely unrunnable in a browser.
Industry View
We note that standard benchmark scores are diverging from engineering delivery capabilities. Models that score high on competition problems may not necessarily handle coherent, complex business logic. Compute consumption is directly tied to enterprise operating costs; inefficient models lack competitiveness in commercial deployment. However, we must also recognize the risks: a "single-round decisive" test in a single scenario involves randomness. Some developers point out that different models vary greatly in their sensitivity to prompts, and MiMo's collapse might stem from a misunderstanding of a specific instruction format rather than an absolute defect in coding ability. Condemning a model entirely based on a single non-standard test might be too arbitrary.
Impact on Regular People
For enterprise IT: When selecting models, you can no longer blindly trust benchmark scores. You must conduct stress tests using your actual business scenarios, otherwise you can easily fall into the trap of "high benchmarks, expensive API calls, and unusable results."
For individual professionals: When using AI for coding assistance, don't just look at "how much it wrote"; spend time on code reviews (checking code logic and structure). Blindly trusting LLM outputs will leave you with severe technical debt.
For the consumer market: There remains a significant gap in the on-device AI capabilities touted by smartphone manufacturers. Stable delivery of complex tasks still requires a longer maturation cycle.