Back to home
SWE-bench
4 articles tagged with this topic
MetaProgramBench
Meta ProgramBench: AI Still Can't Build Large Programs from Scratch
Meta ProgramBench tests AI building programs from scratch. Top models failed, cooling 'AI builds software' hype and exposing benchmark score inflation
May 62 min read
OpenHandsDevin
OpenHands Hits 40K Stars — Open Source Catches Up to Closed-Source AI Coders
OpenHands is an open-source AI coding agent in Docker sandbox with 40K+ GitHub stars. Open source rapidly closing gap with closed-source coding Agents
May 52 min read
Claude Opus 4.7Anthropic
Opus 4.7 来了,我并不建议你升级
Anthrop ic's Opus 4.7 removes temperature/top_p/top_k controls and inflates token counts by up to 1.35x.
Apr 173 min read
LangSmithDeepEval
Stop Chasing Leaderboards: How Berkeley Exposed Flawed AI Agent Benchmarks
Berkeley researchers reveal critical data contamination in top AI benchmarks. Learn how to validate your own agent tools, avoid overfitting, and build
Apr 125 min read