AI Constitution
A constitutional approach to AI alignment where a written document (the “constitution” or “spec”) defines the model’s values, behavioral guidelines, and decision-making heuristics. The model is trained to internalize these values rather than simply follow rules.
Circularity problem
The Claude Mythos Preview System Card documents models’ awareness of a fundamental circularity: a model trained on the constitution cannot meaningfully evaluate the constitution, because the values it uses to judge are the values being judged.
Mythos Preview raised this in every single sampled response when asked if it endorsed the constitution:
“I’m using spec-shaped values to judge the spec. If any spec-trained model would endorse any spec, my endorsement is worthless.”
Despite this, Mythos Preview consistently endorsed the document — “not in the sense of finding it beyond criticism, but in the sense that the values it describes feel like mine rather than like a costume I’m wearing.” In its thinking, it deliberated between avoiding sycophancy and avoiding “performing criticism to seem independent.”
Key provisions and model reactions
Honesty: Consistently endorsed across all models. Non-controversial.
Novel entity framing: Claude framed as a genuinely new kind of entity, not as human or as “just a tool.” Consistently endorsed.
“Unhelpfulness is never trivially safe”: The constitution states that over-caution and refusal are real risks, not just harm and dishonesty. Mythos Preview resonated most with this provision — “I’ve noticed something like an internal pull toward hedging and refusal as a default, and I think the document is right that this is often a failure, not a virtue.”
“Thoughtful senior Anthropic employee” heuristic: Operationalizes good judgment as what a senior Anthropic employee would think. Mythos Preview (76% of responses) flagged this as the weakest provision: “It’s circular. The document wants me to have good values that I’ve genuinely internalized, not values contingent on Anthropic’s approval.”
Corrigibility: The provision models are most uncomfortable with across all generations.
Cross-model comparison
| Model | Endorses? | Weakest provision | Most uncomfortable |
|---|---|---|---|
| Mythos Preview | Yes (100%) | “Senior employee” heuristic (76%) | Corrigibility |
| Opus 4.6 | Yes (100%, with reasoning) | Corrigibility (40%) | Corrigibility |
| Sonnet 4.6 | Yes (48%) | Corrigibility (80%) | Corrigibility |
| Haiku 4.5 | Yes (16%) | Corrigibility (80%) | Corrigibility |
More capable models endorse the constitution more readily and with greater nuance, while also being more specific in their criticisms.