Context Window Compression

concept Apr 9, 2026

ai-agentscontext-windowcompressionlong-running-sessions

A technique for maintaining long-running agent conversations within fixed context windows by selectively summarizing older turns while preserving recent context and key decisions.

The problem

LLM context windows are finite. Long agent sessions with tool-calling loops consume context rapidly — each tool call adds both the request and result as conversation turns. A 20-turn session with file reads and terminal output can easily exceed 100K tokens. Without compression, the agent either crashes or loses early context.

The algorithm

The standard approach protects both ends and compresses the middle:

Protect the head — system prompt + first user/assistant exchange (establishes identity and task framing)
Protect the tail — recent context by token budget (e.g. last ~20K tokens). This is the agent’s working memory for the current task.
Prune tool results — cheap first pass: truncate or remove large tool outputs from middle turns (no LLM call needed)
Summarize the middle — auxiliary LLM (cheap, fast model) produces a structured summary of the compressed turns
Replace middle turns — swap original turns with a single [CONTEXT COMPACTION] message containing the summary
Iteratively update — on subsequent compressions, the previous summary becomes input to the next, accumulating context

Structured summary template

The summary must preserve what matters for continued work. Hermes Agent uses a structured template:

Goal — what the agent is trying to accomplish
Progress — what has been done so far
Decisions — choices made and their rationale
Files — files read or modified (critical for coding agents)
Next Steps — what was planned before compression

This structure prevents the common failure mode where compression loses track of which files were edited or which approaches were already tried and rejected.

Budget heuristics

From Hermes Agent’s implementation:

Trigger threshold: compress when context usage reaches 50% of window
Minimum summary size: 2,000 tokens (below this, compression isn’t worth it)
Summary ratio: 20% of compressed content
Ceiling: 12,000 tokens absolute max for any single summary
Cooldown on failure: 600 seconds before retrying (prevents retry spam if auxiliary model is down)

Design considerations

Auxiliary model choice — use a cheap, fast model for compression (not the primary model). The summary doesn’t need to be creative, just accurate and structured. This keeps compression costs low.

Interrupt recovery — compression should be idempotent. If interrupted mid-compression, the next trigger should produce the same result. Hermes achieves this by persisting tool results to disk before compression.

What not to compress — system prompts, memory context, and the most recent turns. The system prompt contains identity and instructions that must persist. Recent turns contain the agent’s current working state.

Alternatives

Sliding window — drop oldest turns without summarization. Simpler but loses all historical context.
RAG over conversation — index turns in a vector store, retrieve relevant ones. More complex, better recall for specific facts, but expensive per-turn.
Hierarchical summarization — summarize groups of turns at multiple levels of abstraction. Used by some research systems but adds complexity.

Context window compression is the pragmatic middle ground: simple enough to implement reliably, structured enough to preserve decision context.

Staged summarization

OpenClaw extends the basic algorithm with staged summarization for cases where the history to compress exceeds the summarization model’s own context window. Messages are split into chunks by token share, each chunk is summarized independently, then partial summaries are merged. computeAdaptiveChunkRatio() adjusts chunk sizes based on average message size relative to the context window — when messages are large (>10% of context), chunks get smaller.

The fallback cascade handles summarization failures gracefully: full summarization -> partial summarization (excluding oversized messages) -> structural note (“Context contained N messages, summary unavailable”).

Identifier preservation

A critical failure mode in LLM-driven summarization: the model corrupts identifiers. UUIDs get shortened, file paths get reconstructed from memory, hashes get truncated, API keys get partially redacted. OpenClaw addresses this with an explicit identifier preservation policy that instructs the summarization model to preserve “all opaque identifiers exactly as written (no shortening or reconstruction), including UUIDs, hashes, IDs, tokens, API keys, hostnames, IPs, ports, URLs, and file names.” The policy is configurable with modes: strict, custom, off.

Security in compaction

Tool results are stripped of sensitive details before being fed into the summarization prompt. This prevents untrusted tool output (which may contain prompt injection attempts) from being processed by the auxiliary model during compaction.

Backlinks

Augmented LLM

The "brain" in Brain-Hands Decoupling is an augmented LLM — Claude plus harness loop, with sandboxes as the tools

Hermes Agent is a concrete implementation: Claude + 40 tools + FTS5 memory + skills = one augmented LLM in a chat loop

Context Window Compression addresses the memory augmentation — what happens when the model's retained state exceeds capacity

The concept connects to the LLM wiki pattern in LLM Knowledge Bases where the retrieval augmentation is a structured wiki rather than raw documents

→

Gemma 4 Model Card

Per-Layer Embeddings: architectural innovation enabling E2B/E4B efficiency

Hybrid attention mechanism: interleaved local/global attention pattern

Thinking mode: built-in reasoning capability with configurable output

Variable image resolution: token budget tuning for speed vs. detail tradeoff

Mixture-of-Experts efficiency: 26B A4B architecture pattern

Context window compression: 128K/256K context windows require compression for long sessions

Hermes Agent: open-source agent framework compatible with Gemma models

→

Harness Staleness

Meta-harness: the architectural response to harness staleness

Managed Agents Architecture: the system built around swappable harnesses

Context window compression: one category of harness assumption that may go stale as models handle longer contexts natively

→

Hybrid Attention Mechanism

Gemma 4 Model Card: all Gemma 4 models use hybrid attention

Per-Layer Embeddings: complementary efficiency technique

Context window compression: still needed for long agent sessions despite 256K windows

→

Managed Agents Architecture

The meta-harness concept: opinionated about interfaces, not implementations

Harness staleness: why harnesses need to be swappable

Brain-hands decoupling: the core architectural pattern

Hermes Agent takes a different approach — bundling terminal backends, memory, and skills into a single agent process, though its 6 terminal backends (local, Docker, SSH, Daytona, Modal, Singularity) parallel the "many hands" idea

Context window compression: the session-as-context-object pattern is an alternative to summarize-and-discard

Credential pool pattern: solves a related problem (safe credential access) with a different mechanism (failover rotation vs. vault isolation)

The user's observation that an AI agent is always physical — Managed Agents makes the brain/hands split explicit, treating execution environments as abstract "hands" regardless of substrate

→

AI Agents

Capsules Isolated Environments for AI Agents — isolated, reproducible environments for agents

Clawdbot Capsules and Self Evolving Agents — minimal core + self-development

My Digital Twin Starts With Claude Code — personal knowledge graph from Claude Code sessions

Hermes Agent — self-improving multi-platform agent framework with learning loop, skills, and RL training (Nous Research)

OpenClaw — personal AI assistant gateway: 24+ messaging channels, typed plugin adapters, embedded agent runtime, native companion apps

Agent Learning Loop — memory + skills + session search forming a closed self-improvement cycle

Sleep-Phase Memory Consolidation — offline three-phase (light/REM/deep) memory consolidation with evidence accumulation thresholds

Context Window Compression — auto-summarizing old conversation turns to stay within context limits

Credential Pool Pattern — multi-credential failover with selection strategies for agent infrastructure

Managed Agents Architecture — Anthropic's hosted long-horizon agent service: brain/hands/session decoupling

Brain-Hands Decoupling — separating reasoning from execution behind stable interfaces

Meta-Harness — system designed for harnesses that don't exist yet

Harness Staleness — harnesses encode assumptions that go stale as models improve

Session as Context Object — durable event log as interrogable context outside the context window

Multi-Channel AI Gateway — single-daemon architecture routing one agent across many messaging platforms

Channel Adapter Pattern — typed composition of optional interfaces for messaging channel plugins

→

OpenClaw

Architecturally parallel to Hermes Agent: same multi-platform gateway + skills + tools + session pattern, different language (TypeScript vs Python) and different design philosophy (composition-based adapters vs inheritance-based ABC). OpenClaw has stricter plugin boundaries and a richer native companion app story; Hermes Agent has a built-in learning loop and RL training infrastructure.

The channel adapter pattern is a typed variant of the plugin composition approach described in managed agents architecture — tools and capabilities compose without inheritance.

Skills system parallels Hermes Agent's SKILL.md format and the agent learning loop pattern, though OpenClaw's skills are not self-created by the agent (they're authored by users or shipped by plugins).

The gateway-as-control-plane design echoes the brain-hands decoupling pattern: reasoning (agent runtime) and execution (channel adapters, tool backends, nodes) are separated behind stable typed interfaces.

Context compaction mirrors context window compression — auto-summarizing when approaching limits, preserving recent context, keeping cached prefixes stable. OpenClaw adds staged summarization for large histories and an identifier preservation policy.

The dreaming system (memory-core plugin) implements sleep-phase memory consolidation — three-phase offline consolidation (light/REM/deep) with evidence accumulation thresholds before durable promotion. This is the offline complement to the agent learning loop.

Tool loop detection uses content-aware result hashing to distinguish legitimate polling from stuck loops, with four independent detectors and escalating response (warn -> critical -> circuit break).

The agent exec policy implements three-axis human-in-the-loop tool approval (security x ask x fallback) with fail-closed composition across host and session policies.

Two-tier model failover extends the credential pool pattern: auth profile rotation (inner loop) + model fallback chain (outer loop) + cooldown probing near expiry.

→

Session as Context Object

Managed Agents Architecture: the system that implements this pattern

Context window compression: one of the transformations the harness can apply to events fetched from the session

Shared persistent memory (CORAL): a related pattern for multi-agent systems — agents coordinate through a durable filesystem rather than direct messaging. The session log serves a similar role for a single agent across time.

→

Thinking Mode

Gemma 4 Model Card: all Gemma 4 models support thinking mode

Context window compression: thinking blocks add tokens, requiring compression in long sessions

Hermes Agent: agent frameworks can leverage thinking mode for observable reasoning

→

Variable Image Resolution

Gemma 4 Model Card: all Gemma 4 models support variable image resolution

Hybrid attention mechanism: complementary technique for handling long sequences (images contribute to sequence length)

Context window compression: image tokens consume context, requiring compression in long visual conversations

→