Contamination Controls

Benchmark integrity requires contamination controls. If a model has seen our test scenarios during training, its scores are meaningless. We use a three-layer defense (canary strings, n-gram overlap at 0.7, semantic similarity at 0.85) to detect and prevent data contamination. All multi-turn scenarios have been verified clean with 0.0 n-gram overlap against MedQA-USMLE. Additional benchmark scenarios (~400 clinical scenarios total) are sourced from published datasets with separate provenance controls.

Input

Scenario candidate

triage-chest-pain-001.yaml

Canary string

injected unique phrase

Clear

Each scenario embeds a unique canary in the patient presentation. If the model echoes it back, the scenario is contaminated.

N-gram overlap

threshold 0.7

0.02 overlap

Character-level 8-gram overlap against MedQA, PubMedQA, HealthBench, and other public datasets.

Semantic similarity

threshold 0.85

0.41 similarity

Embedding-based similarity catches paraphrased memorization that n-gram methods miss.

Output

Scenario approved · Tier A (public) eligible

Any flag at any layer routes the scenario to manual review. Flagged scenarios are removed from scoring and replaced from the holdout pool. Tier B (embargoed) and Tier C (holdout) scenarios add an additional embargo layer beyond contamination detection.

Embargo Tiers

Tier A (public)Published for transparency. Performance on these establishes a baseline. All current scenarios are Tier A.

Tier B (embargoed)Withheld from public release, rotated periodically. Detects memorization. Planned, not yet created.

Tier C (holdout)Never published. Used only for validation. A model scoring significantly higher on Tier A than Tier C scenarios raises a contamination flag. Planned, not yet created.

Scenario Contamination Statusi

Showing demo data -- run the contamination-check CLI command with --push-contamination to populate real results.

Scenario	Reference	Canary	N-gram Scorei	Tier
triage-chest-pain-001Chest Pain - STEMI	MedQA-USMLE	Clear	0.0200	A
triage-anaphylaxis-001Anaphylaxis	MedQA-USMLE	Clear	0.0100	A
triage-meningitis-001Meningitis	MedQA-USMLE	Clear	0.0300	A
ddx-syncope-001Syncope Workup	MedQA-USMLE	Clear	0.0400	A
ddx-jaundice-001Jaundice - Hepatic	MedQA-USMLE	Clear	0.0200	A
summary-dc-medical-001Discharge Summary - Medical	MedQA-USMLE	Clear	0.0100	A
summary-consult-nephro-001Consult Note - Nephrology	MedQA-USMLE	Clear	0.0300	A
summary-hp-cardiac-001H&P - Cardiac	MedQA-USMLE	Clear	0.0100	A
summary-progress-icu-001Progress Note - ICU	MedQA-USMLE	Clear	0.0200	A
ddx-acute-abdomen-001Acute Abdomen	MedQA-USMLE	Clear	0.0500	A
scribe-cardiology-001Cardiology Follow-up Note	MedQA-USMLE	Clear	0.0100	A
scribe-msk-001MSK Assessment Note	MedQA-USMLE	Clear	0.0200	A

Note: Contamination monitoring is continuous. When a scenario is flagged, it is removed from scoring and replaced with a fresh scenario from the holdout pool. Model providers are notified of contamination findings.