Context Window Compression
A technique for maintaining long-running agent conversations within fixed context windows by selectively summarizing older turns while preserving recent context and key decisions.
The problem
LLM context windows are finite. Long agent sessions with tool-calling loops consume context rapidly — each tool call adds both the request and result as conversation turns. A 20-turn session with file reads and terminal output can easily exceed 100K tokens. Without compression, the agent either crashes or loses early context.
The algorithm
The standard approach protects both ends and compresses the middle:
- Protect the head — system prompt + first user/assistant exchange (establishes identity and task framing)
- Protect the tail — recent context by token budget (e.g. last ~20K tokens). This is the agent’s working memory for the current task.
- Prune tool results — cheap first pass: truncate or remove large tool outputs from middle turns (no LLM call needed)
- Summarize the middle — auxiliary LLM (cheap, fast model) produces a structured summary of the compressed turns
- Replace middle turns — swap original turns with a single
[CONTEXT COMPACTION]message containing the summary - Iteratively update — on subsequent compressions, the previous summary becomes input to the next, accumulating context
Structured summary template
The summary must preserve what matters for continued work. Hermes Agent uses a structured template:
- Goal — what the agent is trying to accomplish
- Progress — what has been done so far
- Decisions — choices made and their rationale
- Files — files read or modified (critical for coding agents)
- Next Steps — what was planned before compression
This structure prevents the common failure mode where compression loses track of which files were edited or which approaches were already tried and rejected.
Budget heuristics
From Hermes Agent’s implementation:
- Trigger threshold: compress when context usage reaches 50% of window
- Minimum summary size: 2,000 tokens (below this, compression isn’t worth it)
- Summary ratio: 20% of compressed content
- Ceiling: 12,000 tokens absolute max for any single summary
- Cooldown on failure: 600 seconds before retrying (prevents retry spam if auxiliary model is down)
Design considerations
Auxiliary model choice — use a cheap, fast model for compression (not the primary model). The summary doesn’t need to be creative, just accurate and structured. This keeps compression costs low.
Interrupt recovery — compression should be idempotent. If interrupted mid-compression, the next trigger should produce the same result. Hermes achieves this by persisting tool results to disk before compression.
What not to compress — system prompts, memory context, and the most recent turns. The system prompt contains identity and instructions that must persist. Recent turns contain the agent’s current working state.
Alternatives
- Sliding window — drop oldest turns without summarization. Simpler but loses all historical context.
- RAG over conversation — index turns in a vector store, retrieve relevant ones. More complex, better recall for specific facts, but expensive per-turn.
- Hierarchical summarization — summarize groups of turns at multiple levels of abstraction. Used by some research systems but adds complexity.
Context window compression is the pragmatic middle ground: simple enough to implement reliably, structured enough to preserve decision context.
Staged summarization
OpenClaw extends the basic algorithm with staged summarization for cases where the history to compress exceeds the summarization model’s own context window. Messages are split into chunks by token share, each chunk is summarized independently, then partial summaries are merged. computeAdaptiveChunkRatio() adjusts chunk sizes based on average message size relative to the context window — when messages are large (>10% of context), chunks get smaller.
The fallback cascade handles summarization failures gracefully: full summarization -> partial summarization (excluding oversized messages) -> structural note (“Context contained N messages, summary unavailable”).
Identifier preservation
A critical failure mode in LLM-driven summarization: the model corrupts identifiers. UUIDs get shortened, file paths get reconstructed from memory, hashes get truncated, API keys get partially redacted. OpenClaw addresses this with an explicit identifier preservation policy that instructs the summarization model to preserve “all opaque identifiers exactly as written (no shortening or reconstruction), including UUIDs, hashes, IDs, tokens, API keys, hostnames, IPs, ports, URLs, and file names.” The policy is configurable with modes: strict, custom, off.
Security in compaction
Tool results are stripped of sensitive details before being fed into the summarization prompt. This prevents untrusted tool output (which may contain prompt injection attempts) from being processed by the auxiliary model during compaction.