What Happened
A developer building an Android app with Google's LiteRT API discovered that Gemma 4 contains multi-token prediction (MTP) weights that are not exposed in the publicly released model. The issue surfaced when loading the model on a Google Pixel 9 test device, which threw errors citing "mtp weights being an incompatible tensor shape." This error pointed directly to additional prediction heads embedded in the LiteRT files — heads designed for speculative decoding to accelerate token generation.
After posting the finding to the Gemma 4 model discussion on Hugging Face (google/gemma-4-E4B-it), the developer received confirmation from a Google employee that Gemma 4 does indeed include MTP architecture internally. The reason given for disabling it publicly was "ensuring compatibility and broad usability." No timeline was provided for when or whether MTP support would be officially enabled.
The affected model is Gemma 4 E4B (4-billion effective parameter mixture-of-experts), Google's on-device variant. This discovery also resurfaces the earlier omission of the larger Gemma 124B model that was briefly referenced in a tweet by Google researcher Jeff Dean before being pulled.
Technical Deep Dive
Multi-token prediction is an inference-time technique where a model uses auxiliary "draft" heads to predict multiple future tokens simultaneously, rather than generating one token per forward pass. This is closely related to speculative decoding, where a smaller draft model or internal draft head proposes several tokens that the main model then verifies in a single pass — significantly increasing throughput without changing output quality.
In Gemma 4's case, the MTP heads appear to be baked into the LiteRT flatbuffer files as additional tensor weights. LiteRT (formerly TensorFlow Lite) uses a static compute graph format, which means the graph structure and weight shapes are fixed at export time. The incompatible tensor shape error suggests the runtime was not configured to route activations through those heads, causing a shape mismatch rather than a clean skip.
Unlike approaches such as Medusa (which adds multiple independent draft heads trained separately) or DeepSeek's MTP implementation (which uses shared transformer layers feeding into sequential prediction heads), Gemma 4's MTP architecture details are not yet public. The LiteRT compute graph could theoretically be reverse-engineered using tools like flatc to decompile the flatbuffer schema, or Netron to visualize the graph and identify the orphaned MTP subgraph.
For comparison, Meta's Llama 3 series does not include native MTP heads, relying instead on external speculative decoding with a separate draft model. Google's own Gemini Nano uses a similar on-device optimization path via LiteRT, making Gemma 4's hidden MTP heads a natural fit for the same pipeline — had they been enabled.
A rough extraction approach would involve:
- Decompiling the
.tfliteflatbuffer withflatc --jsonusing the TFLite schema - Identifying tensor indices associated with MTP head weight names
- Reconstructing the subgraph connections and output shapes
- Patching the model to route the final hidden states through the draft heads
Whether the disabled heads are fully trained or partially initialized remains unknown without access to the training checkpoints.
Who Should Care
On-device AI developers targeting Android hardware, particularly Pixel 8 and Pixel 9 devices with their dedicated NPU capabilities, have the most direct stake here. Enabling MTP on an already fast MoE architecture like Gemma 4 E4B could meaningfully reduce latency for real-time applications such as keyboard suggestions, voice assistants, and document summarization running entirely on-device.
Researchers working on speculative decoding and efficient inference should also pay attention. If the MTP heads are recoverable from the LiteRT files, it would provide a rare look at how Google structures draft heads within a production MoE model — architectural details that are not covered in any public Gemma 4 technical report.
ML engineers deploying Gemma 4 via the Hugging Face Transformers or llama.cpp backends on server hardware are less immediately affected, since those paths use the SafeTensors weights rather than LiteRT flatbuffers, and the MTP heads do not appear to be present there.
What To Do This Week
If you want to investigate the hidden MTP weights yourself, start with the LiteRT model files from the official Gemma 4 E4B release:
- Download the model:
huggingface-cli download google/gemma-4-E4B-it - Install Netron (netron.app) and open the
.tflitefile to visualize the compute graph and spot disconnected MTP subgraphs - Alternatively, install flatbuffers:
pip install flatbuffersand use the TFLite schema from the TensorFlow repo to decompile the binary - Search tensor names for strings containing "mtp", "draft", or "speculative" to locate the relevant weights
- Follow the Hugging Face discussion at
huggingface.co/google/gemma-4-E4B-it/discussions/5for community updates
Avoid modifying production app builds until the subgraph structure is fully understood — routing activations through untested heads without knowing their expected input normalization could produce garbage outputs.