Responsible Scaling Policy
Anthropic’s Responsible Scaling Policy (RSP) is a governance framework that ties model deployment decisions to demonstrated capability levels. As models become more capable, they trigger progressively stricter safety requirements.
AI Safety Levels (ASL)
The RSP defines AI Safety Levels analogous to biosafety levels. Each ASL specifies capability thresholds and required safety measures:
- ASL-1: No meaningful uplift risk. Minimal safety measures.
- ASL-2: Models that can provide some uplift in dangerous domains but don’t exceed what’s freely available. Standard safety measures.
- ASL-3: Models that substantially increase risk in CBRN (Chemical, Biological, Radiological, Nuclear) or cyber domains beyond what’s available through open sources. Requires enhanced containment and monitoring.
- ASL-4+: Models capable of autonomous operation at dangerous levels. Requires strongest containment.
Capability evaluations
The RSP requires evaluating models against specific capability thresholds:
CBRN evaluations: Testing whether the model can provide meaningful uplift to a malicious actor attempting to acquire or deploy chemical, biological, radiological, or nuclear weapons. The System Card for Claude Mythos Preview reports uplift studies comparing expert performance with and without model assistance.
Autonomous Replication and Adaptation (ARA): Testing whether the model can autonomously acquire resources, maintain its own operation, and adapt to obstacles — capabilities that would undermine the ability to shut it down. This connects directly to corrigibility concerns.
Cybersecurity capabilities: Evaluating offensive and defensive cyber capabilities, including vulnerability discovery, exploit development, and defense automation.
Application to Mythos Preview
Mythos Preview’s large capability jump triggered the RSP evaluation process. The evaluation results — particularly in cybersecurity and alignment — led Anthropic to decide against general availability. The model is deployed only for defensive cybersecurity with limited partners, with findings used to inform future model releases and safeguards.
This represents the RSP working as designed: a capability threshold triggered evaluation, evaluation results informed a deployment restriction, and the restriction was applied despite the model’s commercial value.
Relationship to other concepts
- AI scheming evaluations are part of the RSP alignment assessment
- Benchmark contamination complicates capability evaluation — inflated scores may trigger premature or missed ASL transitions
- The RSP framework shapes what Anthropic tests and reports in the system card