Back to home
LLM Evaluation
4 articles tagged with this topic
XiaomiMiMo
Xiaomi MiMo Wastes 6x Compute on Junk Code; LLMs Shift to Delivery Efficiency
Xiaomi MiMo burned 6x compute for junk code while DeepSeek excelled. Benchmarks no longer reflect true dev capability; focus on delivery and costs.
May 62 min read
RAGASRAG
Stop Guessing RAG Quality: RAGAS Uses AI to Grade AI
RAG quality often relies on guesswork. RAGAS uses 4 metrics and LLM-as-Judge to turn gut feelings into engineering KPIs—vital for enterprise knowledge
May 62 min read
XiaomiMiMo-V2.5-Pro
Xiaomi MiMo Tops Reasoning Test: Cost-Efficiency Beats Parameter Count
Xiaomi MiMo-V2.5-Pro wins complex social reasoning tests under $1, shifting AI focus from raw compute to cost-efficiency for enterprise deployment.
May 22 min read
LLM EvaluationAgent
AI 系统好不好,不能靠演 示两个案例说话——一套给复 杂 AI 系统打分的量化方法正 在行业里传开
A reproducible A /B evaluation framework for LLM-enhanced systems is gaining traction—replacing cherry-picked demos with controlled experiments.
Apr 193 min read