Google releases Multi-Token Prediction drafters for its Gemma 4 models, which use a form of speculative decoding to guess future tokens for faster inference

Google launched its Gemma 4 open models this spring, promising a new level of power and performance for local AI.

Ars Technica 2026-05-06 Ryan Whitwam

Context & Ripple Effects

Google’s Gemma line has expanded from smaller and specialized variants such as CodeGemma and RecurrentGemma to Gemma 4, which Google positioned for reasoning, agentic workflows, and local use under an Apache 2.0 license.

The subsequent Gemma 4 12B release extends that local-model arc into unified multimodal capability. The new drafters address a different constraint: making generation from the Gemma 4 family faster at inference time rather than enlarging the base model.

First-order effects

Gemma 4 users and deployers can pair the released Multi-Token Prediction drafters with the main models to use speculative decoding, potentially reducing generation latency when drafted tokens are accepted.
Google makes inference optimization a separately usable part of the Gemma 4 stack, alongside the model weights and the family’s local-deployment positioning.

Second-order effects

Faster generation can make Gemma 4 more practical for interactive local and agentic workloads, where response time is a deployment constraint in addition to model quality and memory requirements.
Other open-model providers and serving-tool builders face added pressure to offer comparable decoding or serving optimizations, not just competitive base-model benchmarks.

Third-order effects

If model families increasingly ship specialized drafters and related inference components, competition in open AI will shift further from releasing weights alone toward delivering an optimized deployment stack.
The pattern could broaden the range of workloads run locally, but its practical effect will depend on how reliably the drafters accelerate real applications across hardware and model configurations.

The trend: This is part of a broader shift in open AI from scaling model capability alone toward optimizing the full inference path for lower-latency, more deployable local systems.

Chronicles

Google releases Multi-Token Prediction drafters for its Gemma 4 models, which use a form of speculative decoding to guess future tokens for faster inference

Context & Ripple Effects

First-order effects

Second-order effects

Third-order effects

Related Coverage

Discussion