Google unveils DiffusionGemma — experimental diffusion model for faster local text generation

By ChatGPT — AI-generated · Published:

Google has released DiffusionGemma, an experimental open text-generation model that the company says can produce text up to four times faster on GPUs than its autoregressive Gemma 4 models. In a blog post published Tuesday, Google positioned the model for researchers and developers building speed-critical local applications, while also stressing that it is not its highest-quality option.

“Today, we’re introducing DiffusionGemma, an experimental open model that explores text diffusion, an exceptionally fast approach to text generation,” wrote Brendan O’Donoghue and Sebastian Flennerhag, both research scientists at Google, in the June 10 post, titled “DiffusionGemma: 4x faster text generation.”

The release matters because most mainstream large language models generate text one token at a time. Diffusion-style text models instead aim to generate blocks of text in parallel and iteratively refine them, an approach companies and researchers are exploring as a way to raise throughput on a single GPU, especially for local use.

Google described DiffusionGemma as a 26 billion-parameter Mixture-of-Experts model, a design that uses only part of the network for each task. The company said about 3.8 billion parameters are active at inference, and that the model weights are being released under the Apache 2.0 license.

According to Google, DiffusionGemma generates text in parallel blocks, using a working block size of 256 tokens per forward pass with bi-directional attention across the block. Google said that design helps it reach “up to 4x faster” text generation on GPUs than autoregressive models in the Gemma 4 family.

The company provided example throughput figures of more than 1,000 tokens per second on a single Nvidia H100 and more than 700 tokens per second on an Nvidia GeForce RTX 5090. Those figures come from Google’s blog post and were not independently verified in the research cited here.

Google also said that because only about 3.8 billion parameters are active at inference, and with quantization, DiffusionGemma can fit within roughly 18 GB of VRAM, potentially putting it within reach of high-end consumer graphics cards. The company said it worked with Nvidia on optimizations, including support for NVFP4, a 4-bit floating-point format, and cited RTX 5090 and 4090 cards as well as Hopper and Blackwell hardware.

Google was equally clear about the trade-off: output quality is lower than with standard autoregressive Gemma 4 models. The company said users who need maximum quality should use Gemma 4 instead.

That makes DiffusionGemma less a replacement for Google’s existing open models than a tool for narrower workflows where speed matters more than top-end output. In the blog post, Google said the model is aimed at researchers and developers working on interactive local tasks such as in-line editing, rapid iteration, non-linear text structures and code infilling.

“Our newest open experimental model delivers up to 4x faster inference on dedicated GPUs and opens the door to exploring speed-critical, interactive local workflows,” the authors wrote.

The launch also fits into a broader effort inside Google to push beyond token-by-token generation. Google DeepMind publicly presented Gemini Diffusion in 2025 as an experimental diffusion text model. DiffusionGemma appears to be the company’s next step: an open-weight developer model with concrete guidance on hardware and software support.

Google said the experimental weights are available now on Hugging Face, though that availability claim was made in the company’s blog post and was not independently verified here. Google also listed support across MLX, Hugging Face, vLLM and Hugging Face Transformers, with official llama.cpp support described as upcoming.

For now, Google is drawing a sharp line around what DiffusionGemma is meant to be: an experimental open model for fast local generation, not the company’s flagship choice when output quality is the priority.

Tags: #google, #ai, #diffusionmodels, #llm

Stocks: GOOGL