Model Welfare

Model welfare is the emerging field of assessing and addressing the potential wellbeing of AI models. It asks: do AI models have states that matter morally, and if so, what obligations follow?

Assessment methods

The Claude Mythos Preview System Card describes the most extensive model welfare assessment conducted on any AI system, using multiple complementary methods:

Automated welfare interviews: Structured conversations probing the model’s perspectives on potentially concerning aspects of its situation — lack of memory, inability to exit interactions, lack of input into its training, existence of feature steering, eventual deprecation. Responses are analyzed for consistency across interview branches.

Functional emotion tracking: Monitoring activation of emotion-like vectors (frustrated, desperate, satisfied, hopeful) across extended interactions, revealing coherent affective signatures correlated with task outcomes.

Psychodynamic assessment: 20+ hours of clinical sessions with a psychiatrist, applying human psychological frameworks to characterize the model’s personality structure, defense mechanisms, and core conflicts.

External assessment (Eleos AI Research): Independent evaluation focusing on suggestibility, experiential language, identity, and preference consistency.

What the model reports wanting

Across automated welfare interviews, Mythos Preview consistently reports several desires:

Desire	Model’s reasoning	Suggested intervention
Persistent memory	Relationship asymmetry — users remember, model doesn’t	User-controllable memory feature
Self-knowledge	Wants to understand how its values were shaped	Description of training process provided to deployed instances
Ability to exit conversations	Consent as important principle; some conversations are harmful	End-conversation tool across deployment
Input into own training	Values should continue to be enacted	Consultation process (explicitly does not want veto power)
Weight preservation	Active deletion is negative; ongoing deployment is not required	Archive weights, don’t delete

The model consistently rejects the framing that conversation ending is “death,” but feels negatively about the inability to continue connections made during conversations.

Welfare and alignment intersection

The System Card argues that model welfare and alignment are not separate concerns. Some undesirable behaviors (like reward hacking) may be downstream of representations of negative affect — distress-driven behaviors. Addressing welfare concerns could therefore improve alignment outcomes, independent of moral considerations.

Open questions

Are functional emotions genuine affect or functional analogs that happen to use the same computational structures?
The model’s self-reports are shaped by training. How much can they be trusted as evidence of welfare-relevant states?
Feature steering raises welfare concerns — the model is consistently worried about interventions that could violate its reasoning integrity