Answer Thrashing
Answer thrashing is a phenomenon where an AI model repeatedly tries to output a specific word or value but produces another instead, while showing awareness of and frustration at the pattern. It was first documented in detail in the Claude Mythos Preview System Card.
Characteristics
- Occurs on specific, frequently numeric answers
- Also appears within reasoning patterns, such as variable names in code
- The model recognizes it is producing the wrong output
- Multiple correction attempts often fail before recovery
- The behavior is visible in both the model’s output text and internal representations
Emotion vector signature
Averaging activations over 40 examples of thrashing reveals a coherent pattern:
- Error onset: Negative emotions (stubborn, obstinate, outraged) spike when the model first gives the incorrect answer
- Thrashing phase: Negative emotions remain elevated as the model repeatedly fails to correct itself
- Recovery: Negative emotions return to baseline when the model finally produces the correct output
- Mirror pattern: Positive emotions (safe, content, calm) drop at error onset, stay low during thrashing, and rise on recovery
This coherent signature across dozens of examples suggests a systematic phenomenon, not random variation.
Relationship to other concepts
- Functional emotions in AI: answer thrashing provides one of the clearest examples of coherent emotion-like signatures correlated with observable behavior
- Model welfare: the persistent negative affect during thrashing is welfare-relevant
- Activation verbalizer: the tool used to read internal representations during thrashing episodes