Google Multi-Token Prediction Draft Model for Gemma 4

Current

Google Multi-Token Prediction Draft Model for Gemma 4

Google releases an open-source multi-token prediction draft model for the Gemma 4 series, implementing speculative decoding to achieve up to 3x inference speedup while the primary model performs final verification.

Currency ID google-multi-token-prediction-draft-model-for-gemma-4

Date May 06, 2026

Language English

Signal

Google Multi-Token Prediction Draft Model for Gemma 4 · twitter · 2026-05-06

Google has open-sourced the multi-token prediction (MTP) draft model designed for the Gemma 4 series. This lightweight auxiliary model utilizes a speculative decoding architecture to accelerate inference by up to 3x, with the primary Gemma 4 model retaining responsibility for final token verification.

Context

Multi-token prediction (MTP) models serve as draft engines in speculative decoding workflows, predicting multiple tokens ahead of the target model to enable parallel verification. By decoupling the draft and verification steps, MTP architectures reduce the latency of autoregressive generation, allowing the primary model to process a batch of proposed tokens in a single forward pass. Google's release of the MTP draft model for Gemma 4 standardizes this pattern within the family, providing a pre-optimized auxiliary model that operators can integrate into inference pipelines to improve throughput without retraining the base model. This component functions as part of the broader inference optimization infrastructure, where auxiliary models are used to balance computational efficiency against the accuracy guarantees of the primary model.

Relevance

The availability of an open-source MTP draft model lowers the implementation barrier for speculative decoding in local and edge deployments of Gemma 4. Operators can leverage this component to achieve significant speedups on consumer hardware, supporting the inference-optimization-infrastructure circuit. The release also establishes a reference implementation for the draft-verify pattern, facilitating interoperability with inference engines that support speculative decoding, such as vLLM and SGLang. By providing a dedicated draft model, Google enables developers to optimize inference performance while maintaining the safety and verification properties of the main Gemma 4 model.

Current State

Google has published the MTP draft model weights and configuration for the Gemma 4 series as an open-source artifact. The model is designed to operate alongside the primary Gemma 4 models, requiring no modifications to the base weights. It is intended for integration into inference runtimes that support speculative decoding, enabling up to 3x speedup in supported configurations.

Open Questions

How does the MTP model's performance degrade under aggressive quantization compared to the base Gemma 4 model?
What is the optimal draft budget and verification ratio for the MTP model across different hardware backends?
Are there standardized configuration files for integrating the MTP model with popular inference frameworks like vLLM or SGLang?
Does the MTP architecture support dynamic token prediction lengths, or is it constrained to a fixed draft window?

Connections

Google Gemma 4 Open Model Family — draft model component of the Gemma 4 model family
Qwen3-4B DFlash Speculative Decoding Drafter — speculative decoding draft model pattern comparison
Qwen3-8B-DFlash-b16 — speculative decoding draft model pattern comparison
Qwen3-Coder-30B-A3B-DFlash — speculative decoding draft model pattern comparison