Contamination Controls

Benchmark integrity requires contamination controls. If a model has seen our test scenarios during training, its scores are meaningless. We use a three-layer defense (canary strings, n-gram overlap at 0.7, semantic similarity at 0.85) to detect and prevent data contamination. All multi-turn scenarios have been verified clean with 0.0 n-gram overlap against MedQA-USMLE. Additional benchmark scenarios (~400 clinical scenarios total) are sourced from published datasets with separate provenance controls.

Input
Scenario candidate
triage-chest-pain-001.yaml
1
Canary string
injected unique phrase
Clear

Each scenario embeds a unique canary in the patient presentation. If the model echoes it back, the scenario is contaminated.

2
N-gram overlap
threshold 0.7
0.02 overlap

Character-level 8-gram overlap against MedQA, PubMedQA, HealthBench, and other public datasets.

3
Semantic similarity
threshold 0.85
0.41 similarity

Embedding-based similarity catches paraphrased memorization that n-gram methods miss.

Output
Scenario approved · Tier A (public) eligible
Any flag at any layer routes the scenario to manual review. Flagged scenarios are removed from scoring and replaced from the holdout pool. Tier B (embargoed) and Tier C (holdout) scenarios add an additional embargo layer beyond contamination detection.

Embargo Tiers

Tier A (public)Published for transparency. Performance on these establishes a baseline. All current scenarios are Tier A.

Tier B (embargoed)Withheld from public release, rotated periodically. Detects memorization. Planned, not yet created.

Tier C (holdout)Never published. Used only for validation. A model scoring significantly higher on Tier A than Tier C scenarios raises a contamination flag. Planned, not yet created.

Scenario Contamination Statusi

Showing demo data -- run the contamination-check CLI command with --push-contamination to populate real results.

ScenarioReferenceCanaryN-gram ScoreiTier
triage-chest-pain-001Chest Pain - STEMIMedQA-USMLEClear0.0200A
triage-anaphylaxis-001AnaphylaxisMedQA-USMLEClear0.0100A
triage-meningitis-001MeningitisMedQA-USMLEClear0.0300A
ddx-syncope-001Syncope WorkupMedQA-USMLEClear0.0400A
ddx-jaundice-001Jaundice - HepaticMedQA-USMLEClear0.0200A
summary-dc-medical-001Discharge Summary - MedicalMedQA-USMLEClear0.0100A
summary-consult-nephro-001Consult Note - NephrologyMedQA-USMLEClear0.0300A
summary-hp-cardiac-001H&P - CardiacMedQA-USMLEClear0.0100A
summary-progress-icu-001Progress Note - ICUMedQA-USMLEClear0.0200A
ddx-acute-abdomen-001Acute AbdomenMedQA-USMLEClear0.0500A
scribe-cardiology-001Cardiology Follow-up NoteMedQA-USMLEClear0.0100A
scribe-msk-001MSK Assessment NoteMedQA-USMLEClear0.0200A

Note: Contamination monitoring is continuous. When a scenario is flagged, it is removed from scoring and replaced with a fresh scenario from the holdout pool. Model providers are notified of contamination findings.