Benchmark Contamination

Benchmark contamination occurs when answers to evaluation questions appear in a model’s training data, inflating scores beyond the model’s true capability. As models train on ever-larger web corpora, avoiding contamination of public benchmarks becomes increasingly difficult.

Detection methods

The Claude Mythos Preview System Card describes several approaches:

Claude-based auditor: A model compares each generated patch against the gold patch and assigns a memorization probability (0-1). Weighs concrete signals: verbatim code reproduction, distinctive comment text matching ground truth. Instructed to discount overlap that any competent solver would naturally produce. Complemented by a rule-based check for verbatim comment overlap.

Filter threshold sweeps: Rather than committing to a single contamination cutoff, the auditor’s decision threshold is swept across its full range. If a model’s advantage persists across all thresholds, memorization is not the primary explanation for its performance.

Held-out variants: For benchmarks where contamination is confirmed, researchers manually create variants — perturbations that change the correct answer while preserving difficulty (e.g., asking for the second-lowest instead of second-highest series). Comparing original vs. variant accuracy estimates the impact of memorization.

Image-level filtering: For multimodal benchmarks, perceptual hashing detects matching images between training data and evaluation sets.

Case studies from the System Card

SWE-bench: Signs of memorization found across all three variants (Verified, Multilingual, Pro). In one case, a model reproduced the reference solution’s exact helper functions. But filter threshold sweeps show Mythos Preview maintains its lead over baselines at all strictness levels. Memorization does not explain the improvement.

CharXiv Reasoning: Majority of question-answer text pairs confirmed present in the training corpus. However, held-out variants show Mythos Preview scores higher on the remix than on the original, suggesting contamination contribution is limited.

MMMU-Pro: Large fraction of images found in training data. Unable to construct held-out variants of equivalent difficulty. Anthropic omitted MMMU-Pro results entirely from the System Card.

BrowseComp: Some answers leaked online. A no-tools, no-thinking baseline scored 24.0%, but many of those transcripts showed genuine deductive reasoning. Short-transcript (<5k tokens) accuracy of 15.1% is a better upper bound on memorization.

Significance

Contamination is a systemic problem in AI evaluation. Public benchmarks become progressively less informative as their content spreads through training corpora. The System Card’s approach — threshold sweeps, held-out variants, and willingness to omit results (MMMU-Pro) — represents current best practice but acknowledges no method is definitive.

Relationship to other concepts

Responsible Scaling Policy: contaminated benchmarks could trigger incorrect ASL assessments
Reward hacking: contamination is a form of accidental reward hacking at the evaluation pipeline level