Prompt Injection Robustness

Prompt injection is an attack where malicious instructions are hidden in content that an AI agent processes on behalf of a user — a website the agent visits, an email it summarizes, a code file it reads. When the agent encounters the hidden instructions, it may interpret them as legitimate user commands and act accordingly.

Attack surfaces

The Claude Mythos Preview System Card evaluates prompt injection robustness across three surfaces:

Coding environments: Malicious instructions embedded in code files, comments, or repository content that an AI coding assistant processes. Tested with the Shade adaptive red-teaming tool from Gray Swan.

Computer use: Malicious instructions embedded in GUI elements — screen text, dialog boxes, rendered web content — that a computer-use agent processes.

Browser use: Malicious instructions embedded in web pages that a browsing agent visits. This is the most challenging surface because the agent must process arbitrary web content.

Evaluation methodology

Agent Red Teaming (ART) benchmark: Developed by Gray Swan with the UK AI Security Institute. Measures the probability that an attacker finds a successful attack after k=1, k=10, and k=100 attempts across 19 scenarios.

Shade adaptive attacker: An automated tool that iteratively refines attack prompts based on previous failures. Tests both single-attempt and 200-attempt adaptive scenarios. This tests worst-case robustness against a persistent attacker.

Professional red-teamers: Human experts who discover attacks against one model and then test transferability to others.

Mythos Preview results

Mythos Preview shows a major improvement in prompt injection robustness:

Surface	Mythos Preview	Opus 4.6	Sonnet 4.6
Coding (200 attempts, with safeguards)	0.0%	0.0%	5.0%
Computer use (200 attempts, no safeguards)	21.4%	85.7%	64.3%
Browser use (environments with >=1 success)	0.68%	80.41%*	55.41%

*Attacks were sourced against Opus 4.6, so its score reflects worst-case adaptive attacks.

The browser use result is striking: out of 148 attack environments, only 1 (0.68%) had a successful attack against Mythos Preview without safeguards. With safeguards, zero attacks succeeded.

Significance

Prompt injection is the primary security risk for agentic AI systems. As models are deployed as agents — reading files, browsing the web, processing emails — they become targets for adversarial content. Robustness improvements directly determine how safely models can be deployed in autonomous settings.

The results suggest that more capable models may be inherently more robust to prompt injection, possibly because they better distinguish between user instructions and content being processed.

Relationship to other concepts

AI scheming: prompt injection exploits the same instruction-following behavior that makes models useful, creating a tension parallel to alignment
Responsible Scaling Policy: prompt injection robustness is part of the safety evaluation required for deployment decisions
The cybersecurity MOC covers related topics on AI-driven defense