Gemma 4 Has Hidden MTP Heads Disabled by Google at Launch

What Happened

A developer building an Android app with Google's LiteRT API discovered that Gemma 4 contains multi-token prediction (MTP) weights that are not exposed in the publicly released model. The issue surfaced when loading the model on a Google Pixel 9 test device, which threw errors citing "mtp weights being an incompatible tensor shape." This error pointed directly to additional prediction heads embedded in the LiteRT files — heads designed for speculative decoding to accelerate token generation.

After posting the finding to the Gemma 4 model discussion on Hugging Face (google/gemma-4-E4B-it), the developer received confirmation from a Google employee that Gemma 4 does indeed include MTP architecture internally. The reason given for disabling it publicly was "ensuring compatibility and broad usability." No timeline was provided for when or whether MTP support would be officially enabled.

The affected model is Gemma 4 E4B (4-billion effective parameter mixture-of-experts), Google's on-device variant. This discovery also resurfaces the earlier omission of the larger Gemma 124B model that was briefly referenced in a tweet by Google researcher Jeff Dean before being pulled.

Technical Deep Dive

Multi-token prediction is an inference-time technique where a model uses auxiliary "draft" heads to predict multiple future tokens simultaneously, rather than generating one token per forward pass. This is closely related to speculative decoding, where a smaller draft model or internal draft head proposes several tokens that the main model then verifies in a single pass — significantly increasing throughput without changing output quality.

In Gemma 4's case, the MTP heads appear to be baked into the LiteRT flatbuffer files as additional tensor weights. LiteRT (formerly TensorFlow Lite) uses a static compute graph format, which means the graph structure and weight shapes are fixed at export time. The incompatible tensor shape error suggests the runtime was not configured to route activations through those heads, causing a shape mismatch rather than a clean skip.

Unlike approaches such as Medusa (which adds multiple independent draft heads trained separately) or DeepSeek's MTP implementation (which uses shared transformer layers feeding into sequential prediction heads), Gemma 4's MTP architecture details are not yet public. The LiteRT compute graph could theoretically be reverse-engineered using tools like flatc to decompile the flatbuffer schema, or Netron to visualize the graph and identify the orphaned MTP subgraph.

For comparison, Meta's Llama 3 series does not include native MTP heads, relying instead on external speculative decoding with a separate draft model. Google's own Gemini Nano uses a similar on-device optimization path via LiteRT, making Gemma 4's hidden MTP heads a natural fit for the same pipeline — had they been enabled.

A rough extraction approach would involve:

Decompiling the .tflite flatbuffer with flatc --json using the TFLite schema
Identifying tensor indices associated with MTP head weight names
Reconstructing the subgraph connections and output shapes
Patching the model to route the final hidden states through the draft heads

Whether the disabled heads are fully trained or partially initialized remains unknown without access to the training checkpoints.

Who Should Care

On-device AI developers targeting Android hardware, particularly Pixel 8 and Pixel 9 devices with their dedicated NPU capabilities, have the most direct stake here. Enabling MTP on an already fast MoE architecture like Gemma 4 E4B could meaningfully reduce latency for real-time applications such as keyboard suggestions, voice assistants, and document summarization running entirely on-device.

Researchers working on speculative decoding and efficient inference should also pay attention. If the MTP heads are recoverable from the LiteRT files, it would provide a rare look at how Google structures draft heads within a production MoE model — architectural details that are not covered in any public Gemma 4 technical report.

ML engineers deploying Gemma 4 via the Hugging Face Transformers or llama.cpp backends on server hardware are less immediately affected, since those paths use the SafeTensors weights rather than LiteRT flatbuffers, and the MTP heads do not appear to be present there.

What To Do This Week

If you want to investigate the hidden MTP weights yourself, start with the LiteRT model files from the official Gemma 4 E4B release:

Download the model: huggingface-cli download google/gemma-4-E4B-it
Install Netron (netron.app) and open the .tflite file to visualize the compute graph and spot disconnected MTP subgraphs
Alternatively, install flatbuffers: pip install flatbuffers and use the TFLite schema from the TensorFlow repo to decompile the binary
Search tensor names for strings containing "mtp", "draft", or "speculative" to locate the relevant weights
Follow the Hugging Face discussion at huggingface.co/google/gemma-4-E4B-it/discussions/5 for community updates

Avoid modifying production app builds until the subgraph structure is fully understood — routing activations through untested heads without knowing their expected input normalization could produce garbage outputs.

Gemma 4 Has Hidden MTP Heads Disabled by Google at Launch

What Happened

Technical Deep Dive

Who Should Care

What To Do This Week

Related Reading

AI Price Discrim ination : Maryland Ban Warning for Small Teams

" AI Will Replace You " Anxiety ? I W oke Up : They 're Harvest ing Panic

Time to Switch AI Assist ants : Claude Quiet ly O vert akes Chat G PT

Sc attered AI Bills ? Open AI Hits AWS Bed rock

Chat G PT Traffic Shows as Direct ? Track With U TM

Your repo could be remotely hij acked — GitHub bug fixed , 10 -min check