DiffusionGemma — AI Glossary

DiffusionGemma is an experimental, open weights multimodal model developed by Google DeepMind. Based on the 26B A4B Mixture of Experts (MoE) Gemma 4 architecture, it represents a fundamental shift in how generative text models operate. Rather than generating output one token at a time sequentially, DiffusionGemma uses discrete text diffusion to generate and refine large blocks of tokens in parallel.

This block autoregressive approach unlocks extreme inference speeds, achieving over 1,100 tokens per second on a single H100 GPU (FP8). This is roughly 4 times faster than comparable autoregressive models.

Architecture & Core Innovations

DiffusionGemma features an Encoder Decoder Architecture combined with an efficient sparse MoE design. It has 25.2B total parameters, with only 3.8B active parameters per token. When quantized, the model fits comfortably within 18GB of VRAM.

Key innovations include:

Discrete Text Diffusion: The model processes 256 token blocks, called “canvases”. It generates text by iteratively denoising these blocks in parallel using bidirectional attention. Once a canvas is fully denoised, it is appended to the KV cache, and the next canvas is generated.
Adaptive Inference Time Computation: Simpler tasks, like code generation or formatting, require fewer denoising steps. This allows the model to dynamically accelerate its token per second output based on the complexity of the prompt.
Interleaved Multimodality: The model natively handles text, video, and image inputs within a single prompt for context heavy tasks. Video sequences are supported up to 60 seconds when processed at 1 frame per second.
Variable Image Resolution: DiffusionGemma supports dynamic visual token budgets of 70, 140, 280, 560, and 1120 tokens. Lower budgets are optimized for rapid video frame processing or high speed captioning, while higher budgets provide the fine grained detail required for precise OCR, chart comprehension, and document parsing.
Thinking Mode: Like its autoregressive counterpart, DiffusionGemma includes a built in reasoning mode triggered by inserting a <|think|> token in the system prompt. When enabled, the model emits an internal <|channel>thought\n block before its final response, enabling step by step logic.

Performance & Trade-offs

Because DiffusionGemma prioritizes speed and parallel generation, it exhibits a documented trade off in dense reasoning capabilities when compared to the standard, sequential Gemma 4 baseline.

The chart below compares the instruction tuned DiffusionGemma 26B A4B against the standard Gemma 4 26B A4B autoregressive model across major evaluation metrics:

Reasoning vs Speed Trade Off: DiffusionGemma vs Gemma 4

Performance on reasoning benchmarks (%)

DiffusionGemma is best suited for speed critical and interactive local workflows. This includes real time in line editing, rapid prototyping, and non linear generative UI. It is not intended to serve as a direct replacement for autoregressive models in heavy reasoning intensive tasks like complex AIME mathematics.

Training Dataset & Preprocessing

The pre training dataset for DiffusionGemma encompasses a diverse, large scale collection of web documents, code, mathematics, and images, with a cutoff date of January 2025.

Web Documents: This covers over 140 languages to ensure a broad range of linguistic styles and topics.
Code & Mathematics: This exposes the model to programming syntax and logical reasoning, improving its ability to generate code and answer symbolic queries.
Data Filtering: Rigorous preprocessing was applied, including CSAM filtering and automated sensitive data filtering to exclude harmful, illegal, or personal information in alignment with responsible AI policies.

Local Server Deployment

For production-level serving, DiffusionGemma is officially supported by vLLM and SGLang.

Serving with vLLM

You can quickly deploy an OpenAI-compatible API server using vLLM:

# Install vLLM
pip install vllm

# Start the server
vllm serve "google/diffusiongemma-26B-A4B-it"

# Make a request with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/diffusiongemma-26B-A4B-it",
		"messages": [
			{
				"role": "user",
				"content": "Explain the concept of entropy."
			}
		]
	}'

Serving with SGLang

SGLang provides optimized performance for multi-modal models:

# Install SGLang
pip install sglang

# Start the SGLang server
python3 -m sglang.launch_server \
    --model-path "google/diffusiongemma-26B-A4B-it" \
    --host 0.0.0.0 \
    --port 30000

Usage, Ethics, and Limitations

While extremely fast, the model is built with specific use cases and limitations in mind.

Intended Applications

Content Creation & NLP Research: Fast text generation for marketing copy, interactive chatbots, code generation, and research.
Image Data Extraction: Rapid OCR, UI comprehension, and visual document parsing.
Interactive UI: Real time, non linear text generation and in line code editing where low latency is critical.

Known Limitations

Factual Accuracy & Context: DiffusionGemma relies on statistical patterns rather than a factual knowledge base. It may generate outdated or incorrect statements and struggle with subtle linguistic nuances like sarcasm.
Reasoning Complexity: It excels at tasks with clear prompts and instructions but may struggle with highly complex, open ended logical reasoning compared to standard autoregressive models.

Ethics and Safety

Developed under rigorous safety guidelines, DiffusionGemma was evaluated against metrics such as hate speech, dangerous content, and bias. Across text and image modalities, the model demonstrated major improvements in content safety with minimal policy violations. This ensures responsible open source deployment.

Best Practices & Sampling

To achieve the reported performance, Google DeepMind recommends using specific diffusion sampling settings, primarily the Entropy Bound (EB) sampler:

Denoising Steps: Maximum of 48 steps.
Temperature Schedule: Linear decay from 0.8 down to 0.4.
Token Selection: The sampler selects the lowest-entropy tokens (Mutual Information bound < 0.1) and fully renoises the rest.
Adaptive Stopping: Generation terminates early if the average model entropy over the canvas falls below 0.005 and the highest-probability token predictions remain identical across two consecutive denoising steps.

When setting up multimodal prompts, Google recommends placing visual content, such as images or videos, before the text prompt for optimal cross attention processing.

Multi Turn Conversations

When handling multi turn conversations with thinking enabled, the historical model output should only include the final response. Internal thoughts or reasoning from previous model turns must not be appended to the context window before the next user turn begins.

Run DiffusionGemma locally via Transformers

python

from transformers import DiffusionGemmaForBlockDiffusion, AutoProcessor

MODEL_ID = "google/diffusiongemma-26B-A4B-it"

# Load model and processor
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = DiffusionGemmaForBlockDiffusion.from_pretrained(
    MODEL_ID, 
    dtype="auto", 
    device_map="auto"
)

# Process prompt
message = [{"role": "user", "content": "Explain discrete text diffusion."}]
input_ids = processor.apply_chat_template(
    message, 
    tokenize=True, 
    add_generation_prompt=True, 
    return_dict=True, 
    return_tensors="pt"
).to(model.device)

# Generate text
output = model.generate(**input_ids, max_new_tokens=512)
text = processor.decode(output[0], skip_special_tokens=False)
print(text)

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams