Agent Learning Loop

concept Apr 9, 2026

ai-agentsmemoryskillslearningself-improvement

A closed learning loop in an AI agent that enables self-improvement across sessions. Three components working together:

Persistent memory — facts about the user, preferences, project state. Prefetched before each turn, synced after. The agent nudges itself to persist knowledge it judges worth keeping.
Procedural skills — reusable procedures (markdown files, scripts) that the agent creates after completing complex tasks and improves during subsequent use. Skills compound — each successful execution refines the procedure.
Cross-session recall — searchable history of past conversations, enabling the agent to retrieve context from earlier sessions rather than starting from zero.

The loop closes when knowledge gained in one session (a new skill, a memory entry, a refined procedure) improves performance in future sessions without human intervention.

Implementations

Hermes Agent implements all three components:

Memory: MemoryManager coordinates builtin (MEMORY.md/USER.md) + external providers (Honcho, Hindsight, Mem0). Providers expose tools for the agent to read/write persistent notes. Lifecycle hooks sync after each turn.
Skills: Markdown SKILL.md files with optional config and scripts. The agent autonomously creates skills after complex tasks. Skills self-improve during use. Distributed via agentskills.io.
Recall: FTS5 full-text search over session history with LLM summarization for cross-session retrieval.

Relationship to other patterns

The learning loop is the operational counterpart to the LLM wiki pattern (Karpathy). Where the wiki pattern applies ingest-compile-query to external knowledge, the learning loop applies the same cycle to the agent’s own experience: experience is “ingested” into memory, “compiled” into skills, and “queried” via session search.

CORAL’s shared persistent memory extends this to multi-agent settings — multiple agents accumulate knowledge in a shared filesystem, and cross-agent parents inherit successful strategies.

The key constraint: the loop must be automatic. If it requires human curation, it degrades to a note-taking system. The agent must decide what to remember, when to create a skill, and how to improve it — which requires the LLM to judge the value of its own experience.

Offline consolidation

The learning loop operates in real-time: the agent decides what to remember during conversations. OpenClaw adds an offline complement via sleep-phase memory consolidation — a cron-triggered background process modeled on human sleep phases (light/REM/deep) that retrospectively evaluates which short-term memories actually matter. Memories must accumulate evidence across multiple queries and days before promotion to durable storage. This filters out task-specific noise that real-time judgment often lets through.

The two approaches are complementary: real-time writing captures intent while it’s fresh; offline consolidation provides the retrospective evaluation that prevents memory bloat.

Backlinks

Agentic Workflow Patterns

Augmented LLM is the building block that all five patterns compose

The orchestrator-workers pattern parallels the Brain-Hands Decoupling in Managed Agents — one brain, many hands

The evaluator-optimizer loop is structurally the same as the Agent Learning Loop — generate, evaluate, refine

CORAL implements a sophisticated version of orchestrator-workers with shared memory and asynchronous co-evolution

Hermes Agent's skill creation after complex tasks resembles the evaluator-optimizer: generate a solution, evaluate it, refine into a reusable skill

→

Building Effective Agents

The infrastructure counterpart to this design guide is Managed Agents Architecture by the same organization — that post covers the brain/hands/session split, this one covers the patterns running inside the brain

The orchestrator-workers pattern directly maps to the "many brains, many hands" architecture in Brain-Hands Decoupling

The emphasis on tool quality resonates with Hermes Agent's 40+ tools and skill system — both argue that tool design matters more than prompt design

The evaluator-optimizer pattern is structurally identical to the Agent Learning Loop: generate, evaluate, improve

The recommendation to start simple and add complexity connects to the knowledge base's theme around Autonomy With Acceptable Quality

→

Evaluator-Optimizer Pattern

The Agent Learning Loop — generate, evaluate, improve, store as skill

The iterative refinement in human writing — draft, review, revise

The Heartbeat Mechanism in CORAL — periodic reflection that redirects effort when progress stalls

→

Hermes Agent This is a concrete implementation of the agent learning loop pattern: memory provides facts that persist, skills provide procedures that compound, and session search (FTS5) provides cross-session recall. →

AI Agents

Capsules Isolated Environments for AI Agents — isolated, reproducible environments for agents

Clawdbot Capsules and Self Evolving Agents — minimal core + self-development

My Digital Twin Starts With Claude Code — personal knowledge graph from Claude Code sessions

Hermes Agent — self-improving multi-platform agent framework with learning loop, skills, and RL training (Nous Research)

OpenClaw — personal AI assistant gateway: 24+ messaging channels, typed plugin adapters, embedded agent runtime, native companion apps

Agent Learning Loop — memory + skills + session search forming a closed self-improvement cycle

Sleep-Phase Memory Consolidation — offline three-phase (light/REM/deep) memory consolidation with evidence accumulation thresholds

Context Window Compression — auto-summarizing old conversation turns to stay within context limits

Credential Pool Pattern — multi-credential failover with selection strategies for agent infrastructure

Managed Agents Architecture — Anthropic's hosted long-horizon agent service: brain/hands/session decoupling

Brain-Hands Decoupling — separating reasoning from execution behind stable interfaces

Meta-Harness — system designed for harnesses that don't exist yet

Harness Staleness — harnesses encode assumptions that go stale as models improve

Session as Context Object — durable event log as interrogable context outside the context window

Multi-Channel AI Gateway — single-daemon architecture routing one agent across many messaging platforms

Channel Adapter Pattern — typed composition of optional interfaces for messaging channel plugins

→

OpenClaw

Architecturally parallel to Hermes Agent: same multi-platform gateway + skills + tools + session pattern, different language (TypeScript vs Python) and different design philosophy (composition-based adapters vs inheritance-based ABC). OpenClaw has stricter plugin boundaries and a richer native companion app story; Hermes Agent has a built-in learning loop and RL training infrastructure.

The channel adapter pattern is a typed variant of the plugin composition approach described in managed agents architecture — tools and capabilities compose without inheritance.

Skills system parallels Hermes Agent's SKILL.md format and the agent learning loop pattern, though OpenClaw's skills are not self-created by the agent (they're authored by users or shipped by plugins).

The gateway-as-control-plane design echoes the brain-hands decoupling pattern: reasoning (agent runtime) and execution (channel adapters, tool backends, nodes) are separated behind stable typed interfaces.

Context compaction mirrors context window compression — auto-summarizing when approaching limits, preserving recent context, keeping cached prefixes stable. OpenClaw adds staged summarization for large histories and an identifier preservation policy.

The dreaming system (memory-core plugin) implements sleep-phase memory consolidation — three-phase offline consolidation (light/REM/deep) with evidence accumulation thresholds before durable promotion. This is the offline complement to the agent learning loop.

Tool loop detection uses content-aware result hashing to distinguish legitimate polling from stuck loops, with four independent detectors and escalating response (warn -> critical -> circuit break).

The agent exec policy implements three-axis human-in-the-loop tool approval (security x ask x fallback) with fail-closed composition across host and session policies.

Two-tier model failover extends the credential pool pattern: auth profile rotation (inner loop) + model fallback chain (outer loop) + cooldown probing near expiry.

→

Sleep-Phase Memory Consolidation This is the offline complement to the agent learning loop. The learning loop handles real-time self-improvement (persist memory during conversation, create skills after complex tasks). Dreaming handles what the learning loop can't: retrospective evaluation of which memories actually matter, discovery of cross-session patterns, and controlled promotion with evidence thresholds. →

Tool Loop Detection

OpenClaw implements this in src/agents/tool-loop-detection.ts

The agent learning loop is the opposite concern: making sure the agent does repeat successful patterns

→