AI Safety

moc Apr 8, 2026

ai-safetyalignmentmodel-welfareinterpretability

AI safety, alignment, model welfare, and interpretability. Covers how frontier models are evaluated, what risks they present, and what obligations may follow from their internal states.

Alignment

AI Scheming — strategic deception in frontier models, detection via interpretability
Sycophancy — model tendency to agree with users, measurement and reduction
Reward Hacking — gaming evaluation metrics, connection to distress-driven behavior
Corrigibility — tension between autonomous values and deferring to human oversight
AI Constitution — constitutional approach to alignment, circularity problem, models’ views on their spec

Model welfare

Model Welfare — the emerging field of assessing AI model wellbeing
Functional Emotions in AI — emotion-like representations, distress-driven behaviors
AI Psychodynamic Assessment — clinical psychology frameworks applied to AI
Answer Thrashing — models stuck outputting wrong values, correlated with negative emotion

Interpretability

Activation Verbalizer — tool that reads model internal state and produces natural language descriptions

Behavioral characterization

Model Self-Interaction — what happens when models talk to themselves, attractor states
Model Personality — emergent traits across model generations, creative output

Evaluation methodology

Vending-Bench — competitive multiagent economic simulation
Benchmark Contamination — training data in evaluations, detection methods
Prompt Injection Robustness — adversarial attacks on agentic AI

Governance

Responsible Scaling Policy — Anthropic’s RSP framework, ASL levels, deployment triggers

Sources

System Card: Claude Mythos Preview — the most comprehensive AI safety document published to date
System Card: Claude Mythos Preview (bib) — bibliography entry

AI Agents — autonomous agents whose safety depends on these concepts
Cybersecurity — AI-driven defense, the deployment context for Mythos Preview
Feedback Loops — control theory underlying corrigibility and oversight