Variable Image Resolution

A multimodal architecture technique where the number of tokens used to represent an image is configurable at inference time, enabling a speed vs. detail tradeoff. Higher token budgets preserve fine-grained visual information at the cost of additional compute; lower budgets enable faster inference for tasks that don’t require pixel-level precision.

How it works

Images are processed by a vision encoder (separate from the language model) that produces a sequence of visual tokens. These tokens are then fed into the language model alongside text tokens. The visual token budget controls how many tokens are used to represent each image.

Gemma 4 supports five token budget levels:

70 tokens — minimal representation
140 tokens — low detail
280 tokens — medium detail (default)
560 tokens — high detail
1120 tokens — maximum detail

The vision encoder adapts its output to match the requested budget, trading off spatial resolution and feature density.

When to use each budget

Low budgets (70–140 tokens):

Image classification
Captioning
Video understanding (processing many frames)
High-throughput applications

Medium budget (280 tokens):

General-purpose vision tasks
Balanced speed and quality
Default for most use cases

High budgets (560–1120 tokens):

OCR (especially small or dense text)
Document parsing (PDFs, forms)
Handwriting recognition
Chart and diagram comprehension
UI and screen understanding (fine UI elements)

Performance implications

Token budget directly impacts:

Latency: Proportional to token count. Processing 1120 tokens takes ~16x longer than 70 tokens (ignoring text tokens).

Context window consumption: Each image consumes its token budget from the model’s context window. At 1120 tokens per image, a 128K context window holds ~114 images (pure image, no text). At 70 tokens per image, ~1,828 images.

Quality: Higher budgets preserve spatial detail, enabling the model to read smaller text, distinguish fine UI elements, and parse complex layouts.

Observed performance

Gemma 4 achieves strong vision benchmark results with default (280-token) budget:

MMMU Pro: 76.9% (31B), 73.8% (26B A4B)
MATH-Vision: 85.6% (31B), 82.4% (26B A4B)
OmniDocBench 1.5 (document parsing, average edit distance): 0.131 (31B), 0.149 (26B A4B)

The document parsing benchmark (OmniDocBench) is where high token budgets matter most — extracting structured data from dense documents requires fine-grained text recognition.

Contrast with fixed-resolution approaches

Approach	Resolution	Token count	Flexibility
Fixed high-res	Always high	Always high	None — always slow
Fixed low-res	Always low	Always low	None — quality ceiling
Variable resolution	Configurable	Configurable	Task-dependent optimization

Fixed-resolution models either over-invest in detail (slow) or under-invest (low quality). Variable resolution enables the right tradeoff per task.

Variable aspect ratio support

Gemma 4 also supports variable aspect ratios — images are not forced into square crops. This preserves layout information in wide documents, screenshots, and panoramic images. Combined with variable resolution, this provides two-dimensional flexibility: aspect ratio (shape) and token budget (detail).

Implementation considerations

The vision encoder must be trained to produce variable token counts. This typically involves:

Adaptive pooling or striding in the encoder
Training with mixed token budgets during pre-training
Ensuring the language model can handle variable-length image token sequences

Inference frameworks must allow per-image token budget specification. In Hugging Face Transformers, this is typically a parameter passed to the processor.

Multi-image scenarios

When processing multiple images in a single prompt (e.g., comparing two screenshots), token budgets can be set independently per image:

Image 1: 1120 tokens (detailed OCR)
Image 2: 140 tokens (rough layout comparison)

This enables fine-grained control over where compute is spent within a single forward pass.

Connections

Gemma 4 Model Card: all Gemma 4 models support variable image resolution
Hybrid attention mechanism: complementary technique for handling long sequences (images contribute to sequence length)
Context window compression: image tokens consume context, requiring compression in long visual conversations