The Anubis-OSS leaderboard updated its data this week: 371 benchmark submissions, 218 competing models, and 10 Apple chips on the board—the ecosystem for local deployment of open-source models has grown large enough to require a serious leaderboard to measure it.

What this is

Anubis-OSS is a community leaderboard focused on the local running capabilities of open-source large models (similar to smartphone benchmarking software, but testing AI models' actual performance on local hardware). Its core question is: without cloud computing power, what models can a local machine actually run, how fast, and how well?

These numbers are worth breaking down: 218 models means the open-source community, which two years ago was still debating whether LLaMA could be used, is now a crowded track requiring horizontal comparisons; 10 Apple chips indicates that the M-series chips (Apple's custom Mac processors, whose unified memory architecture gives them a natural advantage for running large models) are no longer geek experiments, but hardware options officially integrated into the benchmarking system; 371 submissions means the community isn't just putting their names on the board and leaving, but repeatedly tweaking parameters, swapping hardware, and pushing for higher scores.

Industry view

Optimists believe the emergence of such leaderboards is a sign of open-source models maturing. When users can compare "how many tokens/s my M2 Max gets running Qwen2-7B" in a single table, the decision-making cost for local deployment drops significantly. This is especially critical for enterprise intranet deployments and scenarios where data cannot leave the cloud.

But we also note two risks. First, leaderboards inherently encourage "benchmark-tuning optimization"; a model that benchmarks well isn't necessarily the most practical in real-world scenarios. Second, the mainstream models currently deployed locally are still concentrated in the 7B-14B parameter range, and the gap with GPT-4-level capabilities isn't something a leaderboard can bridge. As one community member bluntly put it: "The benchmarking ecosystem is inflating, but most people's actual need is still API calls, not building their own environments."

Impact on regular people

For enterprise IT: Local deployment of open-source models now has quantifiable selection criteria, allowing data-sensitive industries (finance, healthcare) to evaluate "no-cloud" solutions with greater confidence.

For the individual workplace: Apple chips being officially included in benchmarks means the MacBook Pro in your hands is transitioning from an "office tool" to an "AI workstation." Workers who understand local deployment will have more tool choices.

For the consumer market: The more bustling the leaderboard, the easier it is for local AI tools to break into the mainstream. But consumers should be wary: benchmarks do not equal user experience; don't be led astray by numbers.