SWE-bench

4 articles tagged with this topic

Meta ProgramBench: AI Still Can't Build Large Programs from Scratch

Meta ProgramBench tests AI building programs from scratch. Top models failed, cooling 'AI builds software' hype and exposing benchmark score inflation

May 62 min read

OpenHandsDevin

OpenHands Hits 40K Stars — Open Source Catches Up to Closed-Source AI Coders

OpenHands is an open-source AI coding agent in Docker sandbox with 40K+ GitHub stars. Open source rapidly closing gap with closed-source coding Agents

May 52 min read

Claude Opus 4.7Anthropic

Opus 4.7 来了，我并不建议你升级

Anthrop ic's Opus 4.7 removes temperature/top_p/top_k controls and inflates token counts by up to 1.35x.

Apr 173 min read

LangSmithDeepEval

Stop Chasing Leaderboards: How Berkeley Exposed Flawed AI Agent Benchmarks

Berkeley researchers reveal critical data contamination in top AI benchmarks. Learn how to validate your own agent tools, avoid overfitting, and build

Apr 125 min read