What Happened

Users running Gemma 4 through llama.cpp for tool calling have been hitting persistent errors: HTTP 400/500 crashes after returning tool results, or the model ignoring results and issuing a second tool call instead of continuing. A Reddit user traced the failures to four specific bugs in common/chat.cpp within the llama.cpp codebase, using ChatGPT as a debugging assistant to read and patch the C++ source.

The core issue is that Gemma 4 uses a different tool call message format than OpenAI-style APIs. The llama.cpp pipeline was not converting OpenAI-style assistant/tool message history into Gemma's native tool_responses format at the correct pipeline stage, and a JSON parsing bug caused outright crashes on any non-JSON tool output.

The affected code paths are in common_chat_templates_apply_jinja(), common_chat_try_specialized_template(), and workaround::gemma4_model_turn_builder — the specialized Gemma 4 handling layer added to llama.cpp to work around Jinja template limitations.

Technical Deep Dive

Four distinct bugs were identified:

Bug 1: Conversion Happening Too Late

In common_chat_templates_apply_jinja(), the conversion from OpenAI-style tool role messages to Gemma-style tool_responses was occurring after the prompt diff and generation-prompt derivation path had already run. This meant Gemma never saw its expected format. The fix moves the conversion earlier in the function, before any prompt diffing logic executes.

Bug 2: Double Conversion

In common_chat_try_specialized_template(), the same Gemma tool-response conversion was running a second time. Since the messages had already been converted in the first pass, this corrupted the message structure. The fix removes the redundant conversion from this code path.

Bug 3: Missing Empty Content Field

In workaround::gemma4_model_turn_builder::build(), synthesized assistant messages were being emitted without a content field entirely. Gemma 4's chat template requires an explicit empty string for content ("content": "") rather than omitting the field. Without it, the template renderer produced malformed turn boundaries.

Bug 4: Forced JSON Parsing of Tool Output (Critical Crash)

The most severe bug was in workaround::gemma4_model_turn_builder::collect_result(). The code was calling json::parse() on every tool result string unconditionally. This works if your tool returns JSON, but crashes immediately on any plain-text output such as:

[DIR] Components
[FILE] README.md
[FILE] package.json

The fix is straightforward: check whether the result is already a JSON value; if not, keep it as a plain string. Unlike OpenAI's function calling spec, which normalizes all tool outputs to strings, Gemma 4's pipeline internally handles mixed types — but only if the code doesn't forcibly parse them first.

For comparison, frameworks like Ollama and LM Studio implement their own Gemma 4 chat template wrappers and do not share this llama.cpp-specific pipeline, so users of those frontends may not encounter the same crashes.

Who Should Care

This affects any developer or researcher running Gemma 4 locally through llama.cpp directly or through frontends that use llama.cpp as a backend (including some configurations of Open WebUI and llama-cpp-python). The bugs are specifically triggered by the tool-use flow: if you are only doing standard chat completions without tool calls, you will not hit these issues.

Backend engineers building agentic pipelines, coding assistants, or file-system tools on top of Gemma 4 are the primary audience. Any tool that returns non-JSON output — shell commands, directory listings, plain text logs — will hit the JSON parse crash. Teams evaluating Gemma 4 against GPT-4o or Claude for local tool-use tasks should be aware that benchmark results from llama.cpp may be artificially degraded by these bugs rather than reflecting model capability.

What To Do This Week

Until an official patch lands in llama.cpp, apply the fixes manually:

  • Clone llama.cpp: git clone https://github.com/ggerganov/llama.cpp
  • Open common/chat.cpp and locate common_chat_templates_apply_jinja() — move the Gemma tool_responses conversion block above the prompt diff logic.
  • In common_chat_try_specialized_template(), remove or guard the duplicate Gemma conversion so it only runs once.
  • In gemma4_model_turn_builder::build(), ensure synthesized assistant messages include content: "".
  • In gemma4_model_turn_builder::collect_result(), wrap the JSON parse in a try/catch or type check and fall back to keeping the result as a raw string.
  • Rebuild: cmake -B build && cmake --build build --config Release -j$(nproc)
  • Track the upstream issue at the llama.cpp GitHub repo and watch for a PR addressing Gemma 4 tool call handling.