What Happened

A LocalLLaMA community member running Gemma 4 27B locally has identified a practical gap in controlling the model's thinking token behavior. Unlike Qwen-30B-A3B, which responds reliably to system-level instructions to enable or disable chain-of-thought reasoning, Gemma 4's 27B variant behaves inconsistently when given equivalent system prompt directives.

The user found one partial workaround: injecting <thought off> directly into the user message before the actual prompt content. This suppresses reasoning tokens in practice, but it breaks clean API integration — any client or application calling the model would need to prepend this tag to every request, making it impractical for production use or multi-turn chat APIs.

The core issue is that Gemma 4's thinking behavior appears to be driven partly by the chat template and tokenizer-level special tokens rather than purely by natural language system prompt instructions. This makes it harder to control at the application layer compared to models where thinking is toggled via explicit prompt instructions alone.

Technical Deep Dive

Gemma 4's thinking mode, like similar implementations in Qwen3 and DeepSeek-R1, uses special delimiter tokens to wrap internal reasoning. In Gemma 4, the relevant token is <thought> which signals the start of a reasoning block. The model was trained to conditionally generate these blocks based on context, but the triggering mechanism is not purely instruction-following.

Qwen-30B-A3B (and Qwen3 generally) explicitly supports a /no_think system prompt flag and handles it at the chat template level. The template itself conditionally includes or strips thinking delimiters based on this flag. For example, in Qwen3's tokenizer config:

<|im_start|>system /no_think {system_message}<|im_end|>

Gemma 4 does not appear to have an equivalent first-class mechanism documented in its chat template. The model's thinking behavior is instead emergent from training, meaning the model learned when to think based on input complexity signals rather than an explicit toggle.

Potential approaches the community is exploring include:

  • Modifying the Jinja2 chat template in the tokenizer config to inject <thought off> conditionally based on a custom generation parameter
  • Using a system prompt like Do not use internal reasoning. Respond directly without thinking steps. — which works inconsistently
  • Setting max_new_tokens constraints or sampling parameters to discourage long reasoning preambles, though this affects output quality
  • Patching the model's tokenizer_config.json chat template to prepend <thought off> to user turns when a custom flag is passed

The chat template approach is the most robust. In llama.cpp and Ollama, custom Modelfile or GGUF metadata can override the default template. In vLLM or Hugging Face Transformers, you can pass a modified chat_template string directly to tokenizer.apply_chat_template().

Comparison with Alternatives

Qwen3-30B-A3B handles this cleanly at the template level. DeepSeek-R1 uses <think> tags and can be partially controlled by prefilling the assistant turn with a non-thinking token. Gemma 4 currently lacks documented parity with either approach for system-level toggling.

Who Should Care

This matters primarily to developers building applications on top of locally-hosted Gemma 4 models where reasoning token generation has real cost implications — either in latency or token budget. If you're running Gemma 4 27B via Ollama, llama.cpp, or vLLM for tasks like classification, short-form Q&A, or structured data extraction, unnecessary thinking tokens add 200-800 tokens of overhead per request with no benefit.

Teams using OpenAI-compatible API wrappers (LiteLLM, LocalAI, Ollama's REST API) are most affected since they can't easily inject per-message template modifications without middleware. ML engineers fine-tuning Gemma 4 for specific tasks may want to address this during the fine-tuning phase by controlling whether thinking tokens appear in training data.

What To Do This Week

If you need to suppress thinking tokens in Gemma 4 27B today, try modifying the chat template directly. First, extract the current template:

from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("google/gemma-4-27b-it") print(tok.chat_template)

Then locate where user content is inserted and prepend <thought off> conditionally. For Ollama users, edit your Modelfile:

TEMPLATE """{{ if .System }}<start_of_turn>user <thought off> {{ .System }}<end_of_turn> {{ end }}..."""

Test with the /api/chat endpoint and verify reasoning tokens are absent in the response. Track the upstream issue at the Gemma community forums or the google/gemma-4-27b-it Hugging Face repo for an official toggle mechanism.