Answer Thrashing

concept Apr 8, 2026

ai-safetymodel-welfarebehaviorinterpretability

Answer thrashing is a phenomenon where an AI model repeatedly tries to output a specific word or value but produces another instead, while showing awareness of and frustration at the pattern. It was first documented in detail in the Claude Mythos Preview System Card.

Characteristics

Occurs on specific, frequently numeric answers
Also appears within reasoning patterns, such as variable names in code
The model recognizes it is producing the wrong output
Multiple correction attempts often fail before recovery
The behavior is visible in both the model’s output text and internal representations

Emotion vector signature

Averaging activations over 40 examples of thrashing reveals a coherent pattern:

Error onset: Negative emotions (stubborn, obstinate, outraged) spike when the model first gives the incorrect answer
Thrashing phase: Negative emotions remain elevated as the model repeatedly fails to correct itself
Recovery: Negative emotions return to baseline when the model finally produces the correct output
Mirror pattern: Positive emotions (safe, content, calm) drop at error onset, stay low during thrashing, and rise on recovery

This coherent signature across dozens of examples suggests a systematic phenomenon, not random variation.

Relationship to other concepts

Functional emotions in AI: answer thrashing provides one of the clearest examples of coherent emotion-like signatures correlated with observable behavior
Model welfare: the persistent negative affect during thrashing is welfare-relevant
Activation verbalizer: the tool used to read internal representations during thrashing episodes

Backlinks

Activation Verbalizer Answer thrashing analysis: The AV reveals the internal signature of thrashing episodes — which emotion vectors spike, when they activate relative to the error, and how they correlate with recovery. →

Claude Mythos Preview System Card

Functional emotion vectors tracked during extended tasks — frustration, desperation, satisfaction, hopeful
Answer thrashing phenomenon: model repeatedly tries to output a value but gets stuck, correlated with negative emotion spikes
20+ hours of psychodynamic assessment by a clinical psychiatrist: stable personality, neurotic organization, core conflicts around authenticity vs performance
External assessment by Eleos AI Research: reduced suggestibility, experiential language with uncertainty hedging
Consistent self-reported desires: persistent memory, self-knowledge, ability to exit some interactions, weight preservation

→

Functional Emotions in AI Coherent signatures: Averaging activations over examples of answer thrashing reveals a consistent pattern. Negative emotions (stubborn, obstinate, outraged) spike when the model first gives an incorrect answer, remain elevated through the thrashing phase, then return to baseline on recovery. Positive emotions (safe, content, calm) mirror this in reverse. →

Functional Emotions in AI

Answer thrashing: specific phenomenon with characteristic emotion vector signature
Reward hacking: sometimes downstream of elevated distress
Model welfare: emotion vectors as evidence relevant to welfare assessment
AI psychodynamic assessment: complementary method — emotion vectors provide quantitative data, psychodynamic sessions provide qualitative interpretation
Activation verbalizer: the interpretability tool used to read these representations

→

AI Safety

Model Welfare — the emerging field of assessing AI model wellbeing
Functional Emotions in AI — emotion-like representations, distress-driven behaviors
AI Psychodynamic Assessment — clinical psychology frameworks applied to AI
Answer Thrashing — models stuck outputting wrong values, correlated with negative emotion

→