Gemma 4 Model Card

Official documentation for Gemma 4, Google DeepMind’s open multimodal language model family released in 2026. Four model sizes spanning mobile to server deployment, with architectural innovations for efficiency and reasoning.

Model variants

Model	Total params	Active params	Context	Modalities
E2B	5.1B (2.3B effective)	2.3B	128K	Text, Image, Audio
E4B	8B (4.5B effective)	4.5B	128K	Text, Image, Audio
26B A4B	25.2B (MoE)	3.8B	256K	Text, Image
31B	30.7B (dense)	30.7B	256K	Text, Image

“E” = effective parameters. E2B/E4B use Per-Layer Embeddings (PLE), which reduces active parameters but inflates total parameter count. “A” = active parameters. The MoE variant activates only 3.8B of its 25.2B parameters per forward pass.

Architecture

All models use a hybrid attention mechanism — interleaved local sliding window attention (512 or 1024 tokens) with full global attention. The final layer is always global. This delivers the speed of lightweight models with the context awareness required for long-horizon tasks.

Global layers use unified Keys and Values plus Proportional RoPE (p-RoPE) to optimize memory for long contexts.

The 26B A4B model implements Mixture-of-Experts with 8 active experts per token, 128 total experts, and 1 shared expert. This achieves near-31B performance at near-4B inference cost.

Reasoning and thinking mode

Gemma 4 includes built-in reasoning capability. Models generate step-by-step internal reasoning before answering. Thinking mode is controlled via the <|think|> token in the system prompt and enable_thinking parameter in the chat template.

When thinking is enabled, output structure is:

<|channel>thought
[Internal reasoning]
<channel|>
[Final answer]

For E2B/E4B, disabling thinking produces empty thought blocks. For 26B A4B and 31B, thinking is always generated but can be suppressed in output.

Best practice: exclude thinking content from conversation history in multi-turn exchanges to prevent context bloat.

Multimodal capabilities

Image understanding: Object detection, document/PDF parsing, screen and UI understanding, chart comprehension, OCR (multilingual), handwriting recognition, and pointing. Variable image resolution supported via configurable visual token budget (70, 140, 280, 560, 1120 tokens). Higher budgets preserve fine-grained detail for OCR and small text; lower budgets enable faster inference for classification and video processing.

Video understanding: Process video as frame sequences. Maximum 60 seconds at 1 frame per second.

Audio (E2B/E4B only): Automatic speech recognition (ASR) and speech-to-translated-text (AST). Maximum 30 seconds. Supports multiple languages (CoVoST dataset evaluated).

Interleaved input: Mix text and images in any order within a single prompt. Modality order best practice: place images and audio before text.

Function calling and agentic use

Native function calling support for structured tool use. Gemma 4 achieves notable improvements on coding benchmarks (80.0% LiveCodeBench v6, Codeforces ELO 2150 for 31B) and supports autonomous agentic workflows.

Native system prompt support via the system role enables structured, controllable conversations — a departure from previous Gemma generations.

Benchmark performance

Selected results for instruction-tuned models:

Benchmark	31B	26B A4B	E4B	E2B
MMLU Pro	85.2%	82.6%	69.4%	60.0%
AIME 2026 (no tools)	89.2%	88.3%	42.5%	37.5%
GPQA Diamond	84.3%	82.3%	58.6%	43.4%
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%
MMMU Pro (vision)	76.9%	73.8%	52.6%	44.2%
MATH-Vision	85.6%	82.4%	59.5%	52.4%

The 26B A4B MoE achieves 94–97% of 31B dense performance while running at approximately 4B speed.

Training data and safety

Pre-training dataset includes web documents (140+ languages), code, mathematics, and images. Training cutoff: January 2025.

Data preprocessing: CSAM filtering at multiple stages, sensitive data filtering (PII, credentials), content quality and safety filtering per Google AI Responsibility policies.

Safety evaluations conducted in partnership with internal safety and responsible AI teams. Testing included automated and human evaluations covering child safety, dangerous content, sexually explicit content, hate speech, and harassment. Gemma 4 shows major improvements over Gemma 3 across all content safety categories while maintaining low false refusal rates.

Deployment considerations

E2B and E4B are designed for on-device execution (laptops, high-end phones). The 26B A4B model targets consumer GPUs and workstations. The 31B model requires server-class hardware but delivers frontier-level performance.

All models support Apache 2.0 licensing, enabling commercial use. Available via Hugging Face Transformers with compatibility for vLLM, SGLang, and standard inference tooling.

Recommended sampling parameters

Standardized across all use cases:

temperature=1.0
top_p=0.95
top_k=64

Connections

Per-Layer Embeddings: architectural innovation enabling E2B/E4B efficiency
Hybrid attention mechanism: interleaved local/global attention pattern
Thinking mode: built-in reasoning capability with configurable output
Variable image resolution: token budget tuning for speed vs. detail tradeoff
Mixture-of-Experts efficiency: 26B A4B architecture pattern
Context window compression: 128K/256K context windows require compression for long sessions
Hermes Agent: open-source agent framework compatible with Gemma models