Hybrid Attention Mechanism

An attention pattern that interleaves local sliding window attention with full global attention across transformer layers. Delivers the speed and memory efficiency of lightweight models without sacrificing the deep context awareness required for complex, long-context tasks.

Architecture

Hybrid attention alternates between two mechanisms within a single model:

Local sliding window attention: Each layer attends only to a fixed-size window of recent tokens (512 or 1024 tokens). This is O(n) in memory and compute, making it efficient for long sequences.

Full global attention: Selected layers attend to all tokens in the sequence. This is O(n²) but provides complete context integration. The final layer is always global.

The interleaving pattern varies by model size:

Gemma 4 E2B/E4B: 512-token sliding window
Gemma 4 26B A4B/31B: 1024-token sliding window

Memory optimization for global layers

Global layers use two techniques to reduce memory footprint:

Unified Keys and Values: Rather than maintaining separate key and value projections, global layers share representations. This cuts memory by roughly half for the attention mechanism.

Proportional RoPE (p-RoPE): A variant of Rotary Position Embeddings that scales position encodings proportionally for long contexts, preventing position information from dominating the attention scores at large distances.

These optimizations specifically target the memory bottleneck of long-context inference — the KV cache for global attention.

Why hybrid instead of pure local or global

Pattern	Memory	Speed	Context awareness	Use case
Pure global	O(n²)	Slow	Complete	Short sequences, high-quality reasoning
Pure local	O(n)	Fast	Limited	Long sequences, streaming, low latency
Hybrid	O(n) + sparse O(n²)	Balanced	Layered	Long contexts with reasoning

Pure local attention (e.g., sliding window only) cannot integrate information across distant tokens without propagating it through many layers. Pure global attention scales poorly to 128K or 256K contexts. Hybrid attention achieves both local efficiency and global integration by concentrating global attention in select layers.

Design intuition

Early layers extract local features (nearby token interactions). Middle layers build intermediate abstractions within their sliding windows. Late layers (especially the final global layer) integrate long-range dependencies and produce contextually grounded outputs.

This mirrors the hierarchical structure of many cognitive tasks: local pattern recognition followed by global integration.

Observed performance

Gemma 4 models achieve strong long-context performance with hybrid attention:

128K context (E2B/E4B)
256K context (26B A4B, 31B)
MRCR v2 8-needle 128K task: 66.4% (31B), 44.1% (26B A4B)

The 31B model scores 66.4% on the 8-needle retrieval task at 128K context — a difficult benchmark requiring integrating multiple pieces of information scattered across a long document.

Contrast with other attention patterns

Full attention (GPT-2, GPT-3): all layers global. Simple but memory-intensive.

Sparse attention (Longformer, BigBird): fixed patterns (local + global + random). More rigid than hybrid.

Flash Attention (optimization technique): algorithmic speedup for standard attention, orthogonal to hybrid patterns.

State space models (Mamba, RWKV): replace attention with recurrence. Different tradeoff — constant memory but different inductive bias.

Hybrid attention is a middle ground: most of the efficiency of local attention, most of the capability of global attention.

Implementation considerations

The KV cache for hybrid attention grows linearly (local layers) plus a smaller global component (selected layers). For 256K context with 60 layers and 30 global layers, the memory savings vs. full global attention are substantial (approximately 2x reduction).

Inference frameworks must handle heterogeneous attention patterns across layers. Standard Transformers library implementations support this via attention masks and layer-specific configurations.

Connections

Gemma 4 Model Card: all Gemma 4 models use hybrid attention
Per-Layer Embeddings: complementary efficiency technique
Context window compression: still needed for long agent sessions despite 256K windows