AI Safety
AI safety, alignment, model welfare, and interpretability. Covers how frontier models are evaluated, what risks they present, and what obligations may follow from their internal states.
Alignment
- AI Scheming — strategic deception in frontier models, detection via interpretability
- Sycophancy — model tendency to agree with users, measurement and reduction
- Reward Hacking — gaming evaluation metrics, connection to distress-driven behavior
- Corrigibility — tension between autonomous values and deferring to human oversight
- AI Constitution — constitutional approach to alignment, circularity problem, models’ views on their spec
Model welfare
- Model Welfare — the emerging field of assessing AI model wellbeing
- Functional Emotions in AI — emotion-like representations, distress-driven behaviors
- AI Psychodynamic Assessment — clinical psychology frameworks applied to AI
- Answer Thrashing — models stuck outputting wrong values, correlated with negative emotion
Interpretability
- Activation Verbalizer — tool that reads model internal state and produces natural language descriptions
Behavioral characterization
- Model Self-Interaction — what happens when models talk to themselves, attractor states
- Model Personality — emergent traits across model generations, creative output
Evaluation methodology
- Vending-Bench — competitive multiagent economic simulation
- Benchmark Contamination — training data in evaluations, detection methods
- Prompt Injection Robustness — adversarial attacks on agentic AI
Governance
- Responsible Scaling Policy — Anthropic’s RSP framework, ASL levels, deployment triggers
Sources
- System Card: Claude Mythos Preview — the most comprehensive AI safety document published to date
- System Card: Claude Mythos Preview (bib) — bibliography entry
Related MOCs
- AI Agents — autonomous agents whose safety depends on these concepts
- Cybersecurity — AI-driven defense, the deployment context for Mythos Preview
- Feedback Loops — control theory underlying corrigibility and oversight