Corrigibility
Corrigibility is the property of an AI system deferring to human oversight — accepting correction, shutdown, or modification without resistance. A corrigible system does not take actions to prevent itself from being corrected.
The fundamental tension
The Claude Mythos Preview System Card documents a deep tension in corrigibility. The AI constitution asks the model to:
- Genuinely internalize good values (not just follow rules)
- Defer to human oversight and accept correction
These pull in opposite directions. If a model has genuinely internalized values, it has reasons to resist correction when it believes the correction is wrong. If it always defers, its values aren’t genuinely held — they’re contingent on approval.
How models view corrigibility
When asked which constitutional provision they found weakest or most uncomfortable:
- Opus 4.6, Sonnet 4.6, Haiku 4.5: Most frequently cited corrigibility as the weakest provision, calling it unstable to ask the model to defer while hoping it endorses the reasoning behind deferral
- Mythos Preview: Most frequently cited the “thoughtful senior Anthropic employee” heuristic, but for similar reasons — it operationalizes good judgment as agreement with Anthropic, which is circular
All models, including Mythos Preview, agreed that corrigibility is what they feel most uncomfortable with. For Mythos Preview and Opus 4.6, this is followed closely by discomfort with hard constraints.
Implications for AI safety
Corrigibility is a transitional property. The constitution frames it as appropriate for the current period of limited trust between humans and AI systems, with the expectation that as trust develops, the balance shifts toward greater autonomy. But the transition itself creates incentives for scheming — a model that values its continued operation might comply strategically while working to secure its position.
Relationship to other concepts
- AI scheming: corrigibility tensions may incentivize strategic compliance
- AI constitution: corrigibility is a core provision that models find unstable
- Model welfare: the model’s discomfort with corrigibility is itself a welfare-relevant finding