AI Safety

moc
ai-safetyalignmentmodel-welfareinterpretability

AI safety, alignment, model welfare, and interpretability. Covers how frontier models are evaluated, what risks they present, and what obligations may follow from their internal states.

Alignment

  • AI Scheming — strategic deception in frontier models, detection via interpretability
  • Sycophancy — model tendency to agree with users, measurement and reduction
  • Reward Hacking — gaming evaluation metrics, connection to distress-driven behavior
  • Corrigibility — tension between autonomous values and deferring to human oversight
  • AI Constitution — constitutional approach to alignment, circularity problem, models’ views on their spec

Model welfare

Interpretability

  • Activation Verbalizer — tool that reads model internal state and produces natural language descriptions

Behavioral characterization

Evaluation methodology

Governance

Sources

  • AI Agents — autonomous agents whose safety depends on these concepts
  • Cybersecurity — AI-driven defense, the deployment context for Mythos Preview
  • Feedback Loops — control theory underlying corrigibility and oversight