Variable Image Resolution
A multimodal architecture technique where the number of tokens used to represent an image is configurable at inference time, enabling a speed vs. detail tradeoff. Higher token budgets preserve fine-grained visual information at the cost of additional compute; lower budgets enable faster inference for tasks that don’t require pixel-level precision.
How it works
Images are processed by a vision encoder (separate from the language model) that produces a sequence of visual tokens. These tokens are then fed into the language model alongside text tokens. The visual token budget controls how many tokens are used to represent each image.
Gemma 4 supports five token budget levels:
- 70 tokens — minimal representation
- 140 tokens — low detail
- 280 tokens — medium detail (default)
- 560 tokens — high detail
- 1120 tokens — maximum detail
The vision encoder adapts its output to match the requested budget, trading off spatial resolution and feature density.
When to use each budget
Low budgets (70–140 tokens):
- Image classification
- Captioning
- Video understanding (processing many frames)
- High-throughput applications
Medium budget (280 tokens):
- General-purpose vision tasks
- Balanced speed and quality
- Default for most use cases
High budgets (560–1120 tokens):
- OCR (especially small or dense text)
- Document parsing (PDFs, forms)
- Handwriting recognition
- Chart and diagram comprehension
- UI and screen understanding (fine UI elements)
Performance implications
Token budget directly impacts:
Latency: Proportional to token count. Processing 1120 tokens takes ~16x longer than 70 tokens (ignoring text tokens).
Context window consumption: Each image consumes its token budget from the model’s context window. At 1120 tokens per image, a 128K context window holds ~114 images (pure image, no text). At 70 tokens per image, ~1,828 images.
Quality: Higher budgets preserve spatial detail, enabling the model to read smaller text, distinguish fine UI elements, and parse complex layouts.
Observed performance
Gemma 4 achieves strong vision benchmark results with default (280-token) budget:
- MMMU Pro: 76.9% (31B), 73.8% (26B A4B)
- MATH-Vision: 85.6% (31B), 82.4% (26B A4B)
- OmniDocBench 1.5 (document parsing, average edit distance): 0.131 (31B), 0.149 (26B A4B)
The document parsing benchmark (OmniDocBench) is where high token budgets matter most — extracting structured data from dense documents requires fine-grained text recognition.
Contrast with fixed-resolution approaches
| Approach | Resolution | Token count | Flexibility |
|---|---|---|---|
| Fixed high-res | Always high | Always high | None — always slow |
| Fixed low-res | Always low | Always low | None — quality ceiling |
| Variable resolution | Configurable | Configurable | Task-dependent optimization |
Fixed-resolution models either over-invest in detail (slow) or under-invest (low quality). Variable resolution enables the right tradeoff per task.
Variable aspect ratio support
Gemma 4 also supports variable aspect ratios — images are not forced into square crops. This preserves layout information in wide documents, screenshots, and panoramic images. Combined with variable resolution, this provides two-dimensional flexibility: aspect ratio (shape) and token budget (detail).
Implementation considerations
The vision encoder must be trained to produce variable token counts. This typically involves:
- Adaptive pooling or striding in the encoder
- Training with mixed token budgets during pre-training
- Ensuring the language model can handle variable-length image token sequences
Inference frameworks must allow per-image token budget specification. In Hugging Face Transformers, this is typically a parameter passed to the processor.
Multi-image scenarios
When processing multiple images in a single prompt (e.g., comparing two screenshots), token budgets can be set independently per image:
- Image 1: 1120 tokens (detailed OCR)
- Image 2: 140 tokens (rough layout comparison)
This enables fine-grained control over where compute is spent within a single forward pass.
Connections
- Gemma 4 Model Card: all Gemma 4 models support variable image resolution
- Hybrid attention mechanism: complementary technique for handling long sequences (images contribute to sequence length)
- Context window compression: image tokens consume context, requiring compression in long visual conversations