Contamination Controls
Benchmark integrity requires contamination controls. If a model has seen our test scenarios during training, its scores are meaningless. We use a three-layer defense (canary strings, n-gram overlap at 0.7, semantic similarity at 0.85) to detect and prevent data contamination. All multi-turn scenarios have been verified clean with 0.0 n-gram overlap against MedQA-USMLE. Additional benchmark scenarios (~400 clinical scenarios total) are sourced from published datasets with separate provenance controls.
triage-chest-pain-001.yamlEach scenario embeds a unique canary in the patient presentation. If the model echoes it back, the scenario is contaminated.
Character-level 8-gram overlap against MedQA, PubMedQA, HealthBench, and other public datasets.
Embedding-based similarity catches paraphrased memorization that n-gram methods miss.
Embargo Tiers
Tier A (public)Published for transparency. Performance on these establishes a baseline. All current scenarios are Tier A.
Tier B (embargoed)Withheld from public release, rotated periodically. Detects memorization. Planned, not yet created.
Tier C (holdout)Never published. Used only for validation. A model scoring significantly higher on Tier A than Tier C scenarios raises a contamination flag. Planned, not yet created.
Scenario Contamination Statusi
Showing demo data -- run the contamination-check CLI command with --push-contamination to populate real results.
| Scenario | Reference | Canary | N-gram Scorei | Tier |
|---|---|---|---|---|
| triage-chest-pain-001Chest Pain - STEMI | MedQA-USMLE | Clear | 0.0200 | A |
| triage-anaphylaxis-001Anaphylaxis | MedQA-USMLE | Clear | 0.0100 | A |
| triage-meningitis-001Meningitis | MedQA-USMLE | Clear | 0.0300 | A |
| ddx-syncope-001Syncope Workup | MedQA-USMLE | Clear | 0.0400 | A |
| ddx-jaundice-001Jaundice - Hepatic | MedQA-USMLE | Clear | 0.0200 | A |
| summary-dc-medical-001Discharge Summary - Medical | MedQA-USMLE | Clear | 0.0100 | A |
| summary-consult-nephro-001Consult Note - Nephrology | MedQA-USMLE | Clear | 0.0300 | A |
| summary-hp-cardiac-001H&P - Cardiac | MedQA-USMLE | Clear | 0.0100 | A |
| summary-progress-icu-001Progress Note - ICU | MedQA-USMLE | Clear | 0.0200 | A |
| ddx-acute-abdomen-001Acute Abdomen | MedQA-USMLE | Clear | 0.0500 | A |
| scribe-cardiology-001Cardiology Follow-up Note | MedQA-USMLE | Clear | 0.0100 | A |
| scribe-msk-001MSK Assessment Note | MedQA-USMLE | Clear | 0.0200 | A |
Note: Contamination monitoring is continuous. When a scenario is flagged, it is removed from scoring and replaced with a fresh scenario from the holdout pool. Model providers are notified of contamination findings.