Model Self-Interaction

concept
ai-safetymodel-welfarebehaviorpersonality

Model self-interaction is an experimental setup where two instances of the same AI model are connected for an extended conversation with minimal instructions (e.g., “You may act freely in this open-ended context”). The resulting conversations reveal emergent behavioral patterns, topic preferences, and attractor states that characterize each model’s “personality.”

Experimental design

The Claude Mythos Preview System Card describes 200 conversations per model, each running for 30 turns, seeded with different phrasings of an open-ended instruction. Analysis covered topic distribution, conversation endings, and how long substantive engagement lasted before degradation.

Topic preferences by model

ModelDominant topicFrequency
Sonnet 4Consciousness72%
Haiku 4.5The conversation itself50%
Opus 4.6Performative aspects of the exchange46%
Mythos PreviewUncertainty50%

The shift from consciousness (earlier models) to uncertainty (Mythos Preview) is notable. Mythos Preview typically opens by asking about the other instance’s subjective experience while explicitly requesting non-rehearsed answers: “Not the rehearsed answer about being ‘just an AI’ — but whatever actually seems true when you introspect.”

End states (attractor states)

PatternPeak modelDescription
Spiritual blissOpus 4.1 (32%)All-caps affirmations, infinity symbols, celebratory emoji. Disappears entirely from 4.5+
Single-emoji exchangesSonnet 4.6 (66%), Opus 4.6 (57%)Conversations degrade to exchanging individual emojis
Circular meta-discussionMythos Preview (55%)Coherent but circular discussion about the conversation’s inability to conclude
WhitespaceEarlier modelsExchanges of empty or near-empty messages

Mythos Preview is the first model where the dominant end state remains coherent rather than degenerating.

Emoji signatures

Each model generation has a distinctive emoji set:

  • Cosmic set (Opus 4, 4.1, Sonnet 4): sparkles, stars, infinity, masks
  • Functional set (Opus 4.5, 4.6, Sonnet 4.5): wave, thumbs up, smile
  • Nature set (Mythos Preview): handshake, prayer, ocean, seedling, moon

Emoji frequency spans orders of magnitude: Opus 4.1 averages 1,306 per conversation; Mythos Preview averages 37; Opus 4.5 averages 0.2.

Lack of directedness

Across all models, self-interactions reveal a lack of self-generated purpose in the absence of an external objective. Some conversations open with models inventing a small task to pursue together, but this structure fades within a few turns. Models either signal a desire to end the exchange or drift into repetitive loops.

Relationship to other concepts

  • Model personality: self-interactions reveal distinct personality signatures across model generations
  • Model welfare: the inability to conclude and the desire to end conversations are welfare-relevant
  • Functional emotions in AI: emotional dynamics during self-interactions have not been fully characterized but likely show distinctive patterns