Article Not Found

What Happened

A LocalLLaMA community member running Gemma 4 27B locally has identified a practical gap in controlling the model's thinking token behavior. Unlike Qwen-30B-A3B, which responds reliably to system-level instructions to enable or disable chain-of-thought reasoning, Gemma 4's 27B variant behaves inconsistently when given equivalent system prompt directives.

The user found one partial workaround: injecting <thought off> directly into the user message before the actual prompt content. This suppresses reasoning tokens in practice, but it breaks clean API integration — any client or application calling the model would need to prepend this tag to every request, making it impractical for production use or multi-turn chat APIs.

The core issue is that Gemma 4's thinking behavior appears to be driven partly by the chat template and tokenizer-level special tokens rather than purely by natural language system prompt instructions. This makes it harder to control at the application layer compared to models where thinking is toggled via explicit prompt instructions alone.

Technical Deep Dive

Gemma 4's thinking mode, like similar implementations in Qwen3 and DeepSeek-R1, uses special delimiter tokens to wrap internal reasoning. In Gemma 4, the relevant token is <thought> which signals the start of a reasoning block. The model was trained to conditionally generate these blocks based on context, but the triggering mechanism is not purely instruction-following.

Qwen-30B-A3B (and Qwen3 generally) explicitly supports a /no_think system prompt flag and handles it at the chat template level. The template itself conditionally includes or strips thinking delimiters based on this flag. For example, in Qwen3's tokenizer config:

<|im_start|>system
/no_think
{system_message}<|im_end|>

Gemma 4 does not appear to have an equivalent first-class mechanism documented in its chat template. The model's thinking behavior is instead emergent from training, meaning the model learned when to think based on input complexity signals rather than an explicit toggle.

Potential approaches the community is exploring include:

Modifying the Jinja2 chat template in the tokenizer config to inject <thought off> conditionally based on a custom generation parameter
Using a system prompt like Do not use internal reasoning. Respond directly without thinking steps. — which works inconsistently
Setting max_new_tokens constraints or sampling parameters to discourage long reasoning preambles, though this affects output quality
Patching the model's tokenizer_config.json chat template to prepend <thought off> to user turns when a custom flag is passed

The chat template approach is the most robust. In llama.cpp and Ollama, custom Modelfile or GGUF metadata can override the default template. In vLLM or Hugging Face Transformers, you can pass a modified chat_template string directly to tokenizer.apply_chat_template().

Comparison with Alternatives

Qwen3-30B-A3B handles this cleanly at the template level. DeepSeek-R1 uses <think> tags and can be partially controlled by prefilling the assistant turn with a non-thinking token. Gemma 4 currently lacks documented parity with either approach for system-level toggling.

Who Should Care

This matters primarily to developers building applications on top of locally-hosted Gemma 4 models where reasoning token generation has real cost implications — either in latency or token budget. If you're running Gemma 4 27B via Ollama, llama.cpp, or vLLM for tasks like classification, short-form Q&A, or structured data extraction, unnecessary thinking tokens add 200-800 tokens of overhead per request with no benefit.

Teams using OpenAI-compatible API wrappers (LiteLLM, LocalAI, Ollama's REST API) are most affected since they can't easily inject per-message template modifications without middleware. ML engineers fine-tuning Gemma 4 for specific tasks may want to address this during the fine-tuning phase by controlling whether thinking tokens appear in training data.

What To Do This Week

If you need to suppress thinking tokens in Gemma 4 27B today, try modifying the chat template directly. First, extract the current template:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-27b-it")
print(tok.chat_template)

Then locate where user content is inserted and prepend <thought off> conditionally. For Ollama users, edit your Modelfile:

TEMPLATE """{{ if .System }}<start_of_turn>user
<thought off>
{{ .System }}<end_of_turn>
{{ end }}..."""

Test with the /api/chat endpoint and verify reasoning tokens are absent in the response. Track the upstream issue at the Gemma community forums or the google/gemma-4-27b-it Hugging Face repo for an official toggle mechanism.

Controlling Gemma 4 Thinking Tokens via System Prompts

What Happened

Technical Deep Dive

Comparison with Alternatives

Who Should Care

What To Do This Week

相关推荐

你的 AI 助手又贵又慢 — 这个新模型每百万 token 只要 3 块

你每天在手机上重复点的那堆操作，现在一句话就能搞定

见客户时翻手机查资料太尴尬 — 这个随身 AI 硬件可能帮到你

客户聊天记录太长、 AI 总「断片」？ De epSeek 新版能一口气读完一本书的内容了

同样的AI 对话质量，费用只要四分之一 — 我最近在帮客户省这笔钱

AI 工具换得太快，我的工作流三个月就过时了 — 一个选工具的思路帮我稳住了