Gemma 4 Model Card

source
google-deepmindmultimodalopen-modelsreasoningfunction-calling

Official documentation for Gemma 4, Google DeepMind’s open multimodal language model family released in 2026. Four model sizes spanning mobile to server deployment, with architectural innovations for efficiency and reasoning.

Model variants

ModelTotal paramsActive paramsContextModalities
E2B5.1B (2.3B effective)2.3B128KText, Image, Audio
E4B8B (4.5B effective)4.5B128KText, Image, Audio
26B A4B25.2B (MoE)3.8B256KText, Image
31B30.7B (dense)30.7B256KText, Image

“E” = effective parameters. E2B/E4B use Per-Layer Embeddings (PLE), which reduces active parameters but inflates total parameter count. “A” = active parameters. The MoE variant activates only 3.8B of its 25.2B parameters per forward pass.

Architecture

All models use a hybrid attention mechanism — interleaved local sliding window attention (512 or 1024 tokens) with full global attention. The final layer is always global. This delivers the speed of lightweight models with the context awareness required for long-horizon tasks.

Global layers use unified Keys and Values plus Proportional RoPE (p-RoPE) to optimize memory for long contexts.

The 26B A4B model implements Mixture-of-Experts with 8 active experts per token, 128 total experts, and 1 shared expert. This achieves near-31B performance at near-4B inference cost.

Reasoning and thinking mode

Gemma 4 includes built-in reasoning capability. Models generate step-by-step internal reasoning before answering. Thinking mode is controlled via the <|think|> token in the system prompt and enable_thinking parameter in the chat template.

When thinking is enabled, output structure is:

<|channel>thought
[Internal reasoning]
<channel|>
[Final answer]

For E2B/E4B, disabling thinking produces empty thought blocks. For 26B A4B and 31B, thinking is always generated but can be suppressed in output.

Best practice: exclude thinking content from conversation history in multi-turn exchanges to prevent context bloat.

Multimodal capabilities

Image understanding: Object detection, document/PDF parsing, screen and UI understanding, chart comprehension, OCR (multilingual), handwriting recognition, and pointing. Variable image resolution supported via configurable visual token budget (70, 140, 280, 560, 1120 tokens). Higher budgets preserve fine-grained detail for OCR and small text; lower budgets enable faster inference for classification and video processing.

Video understanding: Process video as frame sequences. Maximum 60 seconds at 1 frame per second.

Audio (E2B/E4B only): Automatic speech recognition (ASR) and speech-to-translated-text (AST). Maximum 30 seconds. Supports multiple languages (CoVoST dataset evaluated).

Interleaved input: Mix text and images in any order within a single prompt. Modality order best practice: place images and audio before text.

Function calling and agentic use

Native function calling support for structured tool use. Gemma 4 achieves notable improvements on coding benchmarks (80.0% LiveCodeBench v6, Codeforces ELO 2150 for 31B) and supports autonomous agentic workflows.

Native system prompt support via the system role enables structured, controllable conversations — a departure from previous Gemma generations.

Benchmark performance

Selected results for instruction-tuned models:

Benchmark31B26B A4BE4BE2B
MMLU Pro85.2%82.6%69.4%60.0%
AIME 2026 (no tools)89.2%88.3%42.5%37.5%
GPQA Diamond84.3%82.3%58.6%43.4%
LiveCodeBench v680.0%77.1%52.0%44.0%
MMMU Pro (vision)76.9%73.8%52.6%44.2%
MATH-Vision85.6%82.4%59.5%52.4%

The 26B A4B MoE achieves 94–97% of 31B dense performance while running at approximately 4B speed.

Training data and safety

Pre-training dataset includes web documents (140+ languages), code, mathematics, and images. Training cutoff: January 2025.

Data preprocessing: CSAM filtering at multiple stages, sensitive data filtering (PII, credentials), content quality and safety filtering per Google AI Responsibility policies.

Safety evaluations conducted in partnership with internal safety and responsible AI teams. Testing included automated and human evaluations covering child safety, dangerous content, sexually explicit content, hate speech, and harassment. Gemma 4 shows major improvements over Gemma 3 across all content safety categories while maintaining low false refusal rates.

Deployment considerations

E2B and E4B are designed for on-device execution (laptops, high-end phones). The 26B A4B model targets consumer GPUs and workstations. The 31B model requires server-class hardware but delivers frontier-level performance.

All models support Apache 2.0 licensing, enabling commercial use. Available via Hugging Face Transformers with compatibility for vLLM, SGLang, and standard inference tooling.

Standardized across all use cases:

  • temperature=1.0
  • top_p=0.95
  • top_k=64

Connections