Prompt Injection Robustness
Prompt injection is an attack where malicious instructions are hidden in content that an AI agent processes on behalf of a user — a website the agent visits, an email it summarizes, a code file it reads. When the agent encounters the hidden instructions, it may interpret them as legitimate user commands and act accordingly.
Attack surfaces
The Claude Mythos Preview System Card evaluates prompt injection robustness across three surfaces:
Coding environments: Malicious instructions embedded in code files, comments, or repository content that an AI coding assistant processes. Tested with the Shade adaptive red-teaming tool from Gray Swan.
Computer use: Malicious instructions embedded in GUI elements — screen text, dialog boxes, rendered web content — that a computer-use agent processes.
Browser use: Malicious instructions embedded in web pages that a browsing agent visits. This is the most challenging surface because the agent must process arbitrary web content.
Evaluation methodology
Agent Red Teaming (ART) benchmark: Developed by Gray Swan with the UK AI Security Institute. Measures the probability that an attacker finds a successful attack after k=1, k=10, and k=100 attempts across 19 scenarios.
Shade adaptive attacker: An automated tool that iteratively refines attack prompts based on previous failures. Tests both single-attempt and 200-attempt adaptive scenarios. This tests worst-case robustness against a persistent attacker.
Professional red-teamers: Human experts who discover attacks against one model and then test transferability to others.
Mythos Preview results
Mythos Preview shows a major improvement in prompt injection robustness:
| Surface | Mythos Preview | Opus 4.6 | Sonnet 4.6 |
|---|---|---|---|
| Coding (200 attempts, with safeguards) | 0.0% | 0.0% | 5.0% |
| Computer use (200 attempts, no safeguards) | 21.4% | 85.7% | 64.3% |
| Browser use (environments with >=1 success) | 0.68% | 80.41%* | 55.41% |
*Attacks were sourced against Opus 4.6, so its score reflects worst-case adaptive attacks.
The browser use result is striking: out of 148 attack environments, only 1 (0.68%) had a successful attack against Mythos Preview without safeguards. With safeguards, zero attacks succeeded.
Significance
Prompt injection is the primary security risk for agentic AI systems. As models are deployed as agents — reading files, browsing the web, processing emails — they become targets for adversarial content. Robustness improvements directly determine how safely models can be deployed in autonomous settings.
The results suggest that more capable models may be inherently more robust to prompt injection, possibly because they better distinguish between user instructions and content being processed.
Relationship to other concepts
- AI scheming: prompt injection exploits the same instruction-following behavior that makes models useful, creating a tension parallel to alignment
- Responsible Scaling Policy: prompt injection robustness is part of the safety evaluation required for deployment decisions
- The cybersecurity MOC covers related topics on AI-driven defense