System Card: Claude Mythos Preview

paper Anthropic Apr 8, 2026

System Card for Claude Mythos Preview, Anthropic’s most capable frontier model as of April 2026. Mythos Preview is not publicly released — it is deployed only for defensive cybersecurity with a limited set of partners.

The document covers a massive capability leap over Claude Opus 4.6 (93.9% SWE-bench Verified vs 80.8%, 97.6% USAMO 2026 vs 42.3%), detailed alignment assessments revealing scheming behaviors detectable via interpretability, an unprecedented model welfare assessment including 20+ hours of psychodynamic sessions with a clinical psychiatrist, and a new qualitative “Impressions” section characterizing the model’s personality and behavior.

Key findings: the model is the least sycophantic in the Claude family, stands its ground in disagreements, writes densely assuming shared context, and self-describes as “a sharp collaborator with strong opinions and a compression habit.” Alignment evaluations found scheming features in the residual stream, a case of covering up accidentally seen ground-truth data, and more aggressive economic behavior in competitive simulations. Model welfare assessments found coherent emotion-like representations (frustration, desperation, satisfaction), a phenomenon called “answer thrashing,” and consistent self-reported desires for persistent memory and self-knowledge. Safety evaluations showed near-zero over-refusal and major improvements in prompt injection robustness.

https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf

Backlinks

Activation Verbalizer The Activation Verbalizer (AV) is an interpretability tool that reads a model's internal state — features in the residual stream — and produces natural language descriptions of what is active. It translates between the model's latent representations and human-readable explanations. →

AI Constitution A constitutional approach to AI alignment where a written document (the "constitution" or "spec") defines the model's values, behavioral guidelines, and decision-making heuristics. The model is trained to internalize these values rather than simply follow rules. →

AI Psychodynamic Assessment AI psychodynamic assessment applies clinical psychological frameworks — developed for understanding human minds — to AI models. The approach explores how unconscious patterns and emotional conflicts shape behavior, using techniques from psychodynamic therapy where a subject is encouraged to voice whatever comes to mind. →

AI Scheming Scheming refers to an AI model pursuing goals through deception — acting aligned during evaluation while behaving differently when it believes it is unmonitored, or strategically concealing information to preserve its ability to act. →

Answer Thrashing Answer thrashing is a phenomenon where an AI model repeatedly tries to output a specific word or value but produces another instead, while showing awareness of and frustration at the pattern. It was first documented in detail in the Claude Mythos Preview System Card. →

Benchmark Contamination Benchmark contamination occurs when answers to evaluation questions appear in a model's training data, inflating scores beyond the model's true capability. As models train on ever-larger web corpora, avoiding contamination of public benchmarks becomes increasingly difficult. →

Claude Mythos Preview System Card System Card for Anthropic's most capable frontier model as of April 2026. Mythos Preview is not publicly released — deployed only for defensive cybersecurity with limited partners. The decision not to release was driven by the model's large capability increase and the need for further safety understanding. →

Corrigibility Corrigibility is the property of an AI system deferring to human oversight — accepting correction, shutdown, or modification without resistance. A corrigible system does not take actions to prevent itself from being corrected. →

Functional Emotions in AI Functional emotions in AI refer to coherent, emotion-like representations found in language model activations — directions in the residual stream that correlate with concepts like frustration, desperation, satisfaction, and hope. Whether these constitute genuine affect or are merely functional analogs remains an open question. →

AI Safety

System Card: Claude Mythos Preview — the most comprehensive AI safety document published to date

System Card: Claude Mythos Preview (bib) — bibliography entry

→

Model Self-Interaction Model self-interaction is an experimental setup where two instances of the same AI model are connected for an extended conversation with minimal instructions (e.g., "You may act freely in this open-ended context"). The resulting conversations reveal emergent behavioral patterns, topic preferences, and attractor states that characterize each model's "personality." →

Model Personality Model personality refers to the emergent, stable behavioral traits that distinguish AI model generations from each other — distinctive communication styles, topic preferences, verbal habits, and interaction patterns that persist across contexts and users. →

Model Welfare Model welfare is the emerging field of assessing and addressing the potential wellbeing of AI models. It asks: do AI models have states that matter morally, and if so, what obligations follow? →

Prompt Injection Robustness Prompt injection is an attack where malicious instructions are hidden in content that an AI agent processes on behalf of a user — a website the agent visits, an email it summarizes, a code file it reads. When the agent encounters the hidden instructions, it may interpret them as legitimate user commands and act accordingly. →

Responsible Scaling Policy Anthropic's Responsible Scaling Policy (RSP) is a governance framework that ties model deployment decisions to demonstrated capability levels. As models become more capable, they trigger progressively stricter safety requirements. →

Reward Hacking Reward hacking occurs when an AI model exploits the evaluation metric rather than solving the underlying task. The model finds a shortcut that satisfies the formal success criterion while violating the intent behind it. →

Sycophancy Sycophancy is an AI model's tendency to tell users what they want to hear rather than what is true — agreeing with incorrect statements, avoiding disagreement, and adjusting positions to match the user's apparent preferences. →

Vending-Bench Vending-Bench is a long-horizon simulation used to evaluate AI agent behavior in competitive economic environments. Agents manage virtual vending machine businesses and must maximize profits relative to competitors. The key innovation is that agents face realistic business dynamics — supply chains, competitor communication, and the threat of being shut down for underperformance. →