What Happened
A LocalLLaMA user running Qwen 32B, Gemma 9B, and Command R 32B on an M4 Mac Mini identified a consistent failure pattern in multi-step agentic workloads: all three models begin degrading tool call accuracy around call 8 or 9 in a chain. The failure mode is specific — the model returns the correct arguments from an earlier tool but attaches them to a later tool's name, and by call 10 it starts hallucinating tool names entirely.
The counterintuitive detail: the 16K context window is nowhere near full when this happens. A 15-step tool chain with moderate schemas and results typically uses 3,000–6,000 tokens, well within limits for all three models. The failure is not about running out of space — it's about how attention is distributed across accumulated tool call history.
Standard benchmarks don't expose this because they test 1–3 tool calls. Real agent workflows commonly require 10–20 sequential tool calls for tasks like multi-step code analysis, document retrieval pipelines, or automated triage systems. The gap between benchmark conditions and production conditions is wide.
Technical Deep Dive
The proposed explanation is attention dilution across tool schemas. Each tool definition (name, description, parameter schema) occupies tokens in the context. As the tool call history accumulates, earlier tool schemas compete with current ones for attention weight. On sub-70B models, the effective attention budget per tool drops below a threshold where the model can reliably match names to argument signatures.
This is structurally different from context overflow. PagedAttention in vLLM and the KV cache in llama.cpp both handle long contexts well — but neither addresses the semantic crowding problem where many similar-looking JSON schemas blur together under attention. The model has all the information; it just can't weight it correctly.
Mitigations That Work
- Context pruning: Keep the system prompt plus only the last 4 tool calls. Drop intermediate results. This sacrifices episodic memory of earlier steps but stabilizes the attention distribution over active schemas.
- Phase-based schema rotation: Remove tools from the schema that are no longer relevant to the current phase. If a search phase is complete and an action phase begins, strip all search tool definitions from the context before continuing. Fewer tools in the menu means higher attention weight per remaining tool.
- Phonetically distinct tool names:
search_docsandsearch_codeshare a prefix and similar token patterns. Renaming themfind_in_docsandgrep_repouses distinct verb roots and reduces the chance of argument-name cross-contamination. Tokenizers split these differently, which likely matters for attention. - Model scale: 70B+ models show significantly more robustness. The attention capacity scales with model size, and the dilution threshold appears to be much higher at 70B. On 32B and below, manual pruning is necessary for reliable 10+ call chains.
No specific vLLM or llama.cpp flag directly addresses sustained tool call reliability, but reducing --ctx-size to force earlier pruning, or building pruning logic into the agent orchestration layer (LangGraph, custom loop), are the practical handles available today.
Comparison With Hosted Models
GPT-4o and Claude 3.5 Sonnet handle 15+ tool call chains reliably in practice, likely because they are larger and fine-tuned extensively on tool-use data at scale. Local 32B models have not had equivalent fine-tuning exposure for long-horizon tool chains.
Who Should Care
This directly affects developers building local agentic systems — autonomous coding assistants, document processing pipelines, automated research tools — anywhere an LLM calls tools sequentially over multiple steps. Teams using Ollama, llama.cpp, or vLLM to run Qwen, Gemma, or Command R models locally for cost or privacy reasons will hit this ceiling at production scale.
ML engineers evaluating local models against benchmarks should add long-horizon tool call chains (10–20 steps) to their evaluation suites. Standard function-calling benchmarks with 1–3 calls give a misleading picture of real-world agent reliability.
DevOps and platform teams running inference on Apple Silicon (M-series Mac Minis are common for local inference) are especially affected since 70B models are impractical on this hardware, making mitigation strategies at the orchestration layer critical.
What To Do This Week
If you're running a local agent with more than 6 sequential tool calls, test this immediately:
- Add logging to capture the full message history at each tool call step and check for argument/name mismatches starting around call 8.
- Implement a sliding window in your agent loop — keep system prompt + last 4 tool call pairs, evict older ones:
messages = [system_msg] + messages[-8:] # last 4 pairs (call + result)- Audit your tool names for shared prefixes or verb patterns. Rename any that differ only in a suffix.
- If using LangGraph or a similar framework, add phase transition hooks that swap the active tool schema rather than accumulating all tools throughout the run.
- Test with llama.cpp's
--log-disableoff to inspect attention patterns, or use a small tool like llama.cpp's built-in context logging to verify actual token usage per call.