LLM Evaluation

4 articles tagged with this topic

Xiaomi MiMo Wastes 6x Compute on Junk Code; LLMs Shift to Delivery Efficiency

Xiaomi MiMo burned 6x compute for junk code while DeepSeek excelled. Benchmarks no longer reflect true dev capability; focus on delivery and costs.

May 62 min read

RAGASRAG

Stop Guessing RAG Quality: RAGAS Uses AI to Grade AI

RAG quality often relies on guesswork. RAGAS uses 4 metrics and LLM-as-Judge to turn gut feelings into engineering KPIs—vital for enterprise knowledge

May 62 min read

XiaomiMiMo-V2.5-Pro

Xiaomi MiMo Tops Reasoning Test: Cost-Efficiency Beats Parameter Count

Xiaomi MiMo-V2.5-Pro wins complex social reasoning tests under $1, shifting AI focus from raw compute to cost-efficiency for enterprise deployment.

May 22 min read

LLM EvaluationAgent

AI 系统好不好，不能靠演示两个案例说话——一套给复杂 AI 系统打分的量化方法正在行业里传开

A reproducible A /B evaluation framework for LLM-enhanced systems is gaining traction—replacing cherry-picked demos with controlled experiments.

Apr 193 min read

LLM Evaluation

Xiaomi MiMo Wastes 6x Compute on Junk Code; LLMs Shift to Delivery Efficiency

Stop Guessing RAG Quality: RAGAS Uses AI to Grade AI

Xiaomi MiMo Tops Reasoning Test: Cost-Efficiency Beats Parameter Count

AI 系统好不好，不能靠演 示两个案例说话——一套给复 杂 AI 系统打分的量化方法正 在行业里传开

AI 系统好不好，不能靠演示两个案例说话——一套给复杂 AI 系统打分的量化方法正在行业里传开