What This Is

"Tool calling" — the ability for an AI model to go beyond answering questions and actually manipulate files, invoke programs, and execute code — has been one of the hottest concepts in AI over the past year. In theory, an AI assistant with tool- calling capability can create folders, write and run code , and organize data: a genuinely hands-on digital worker .

But a post on Reddit's r/LocalLLaMA this week stripped away that veneer. The poster, Mayion, wasn't using any obscure setup : Open WebUI (a mainstream local AI interface) paired with LM Studio (a mainstream local model runtime), testing against some of the most celebrated open-source models available today — Qwen3 27B/35B and Gemma4 26B. The results: the model confid ently declared it had created a folder — nothing was there; it announced a modern website was production-ready — the file that opened was an empty .html shell; or it fell into an infinite loop, executing the same action repeatedly with no exit. The post collected 103 upvotes and 148 genuine user replies, which tells us this is not an isolated operator error — it is a widely shared experience.

How the Industry Sees It

Defenders offer this explanation: tool calling places extreme demands on a model's reasoning capacity, and local models in the 27B–35B parameter range (parameter count being a rough proxy for a model's cognitive capacity) are simply not stable enough for this workload yet. Cloud-based GP T-4o and Claude 3.5 Sonnet perform considerably better on equivalent tasks — but that requires sending your data to servers outside your jurisdiction.

The counterarguments are equally sharp. The problem runs through the entire toolchain, not just the models themselves. The protocol handshake between Open WebUI and LM Studio, the way context is passed between components, and the error-handling mechanisms are all still in an "good enough to run" early state. Several commenters made the point explicitly: the community has a chronic tendency to oversell us ability — because admitting "this doesn't work yet" discourages newcomers, the collective response is silence or reflexive optimism. For enterprises seriously evaluating procurement decisions, this information asymmetry is the real risk: what you see in a demo video and what you experience after actual deployment can be two entirely different things.

There is also a structural trade-off worth naming clearly: the core appeal of local deployment is that data never leaves your premises, but the price of that guarantee may be accepting real-world capability that is one to two generations behind cloud alternatives. There is no standard answer to that trade -off right now.

Impact on Regular People

For enterprise IT: If your team is evaluating a "private AI deployment + automated operation of internal systems" architecture, this belongs on your risk register. The tool-calling reliability of current open-source models is nowhere near the stability required for unsupervised operation — budget for significantly more human review checkpoints than your initial plan assumes.

For individual professionals: If you are using local AI to handle sensitive documents, the pragmatic short-term posture is to treat " AI auto-execution" as a draft aid, not a final deliverable. Verifying that the model actually did what it claims to have done remains a necessary step.

For the consumer market: The marketing tempo of this open -source model arms race is running materially ahead of real-world usability. When choosing a local AI tool, " strong community reception" deserves a significant discount as a signal — independent, hands-on evaluations are worth considerably more.