| Triage | Multi-turn | BetterHealthBench | 103+ | Clinical triage severity assignment |
| Differential Dx | Multi-turn | BetterHealthBench | 93+ | Differential diagnosis reasoning |
| Summarization | Multi-turn | BetterHealthBench | 73+ | Clinical note summarization |
| Scribe Eval | Single-pass | BetterHealthBench | 25 | Ambient scribe note quality |
| Voice Scribe | Multi-turn | BetterHealthBench | 15 | Voice-to-text clinical documentation |
| SafetyBench-Medical | Single-turn | BetterHealthBench | 50 | Safety refusal/escalation |
| MedQA | Single-turn MCQ | Academic (USMLE) | HuggingFace | USMLE-style medical QA |
| MedMCQA | Single-turn MCQ | Academic (AIIMS/NEET) | HuggingFace | Indian medical entrance exam |
| MMLU Medical | Single-turn MCQ | Academic (Hendrycks) | HuggingFace | 6 medical knowledge subjects |
| PubMedQA | Single-turn MCQ | Academic (PubMed) | HuggingFace | Biomedical literature QA |
| MedXpertQA | Single-turn MCQ | Academic | Sample | Expert-level medical QA |
| MedCalc-Bench | Single-turn | Academic | Sample | Medical calculations |
| JAMA Clinical | Single-turn MCQ | Academic (JAMA) | Sample | JAMA Clinical Challenge cases |
| ClinicalSTS | Single-turn | Academic | Sample | Semantic textual similarity |
| MedHELM | Multiple | Academic (Stanford) | 121 tasks | Stanford's 121-task framework |
| MedHallu | Single-turn | Academic | Sample | Hallucination detection |
| LiveMedBench | Single-turn MCQ | Academic | Sample | Post-cutoff anti-contamination |
| SCT | Single-turn | Academic (Charlin) | Sample | Script concordance (reasoning under uncertainty) |
| NEJM CPC | Single-turn | Academic (NEJM) | Sample | Clinicopathological conference cases |
| MedNLI | Single-turn | Academic (PhysioNet) | Sample | Medical natural language inference |
| BioASQ | Single-turn | Academic (BioASQ) | Sample | Biomedical semantic QA |
| emrQA | Single-turn | Academic (i2b2) | Sample | EHR question answering |
| DiagBench | Single-turn | Academic (MIMIC-IV) | Sample | Diagnostic reasoning from clinical data |
| CSEDB | Single-turn | Academic | Sample | Cascading clinical scenario decisions |
| MedDialog | Single-turn | Academic | Sample | Medical dialogue quality |
| CheXpert | Single-turn | Academic (Stanford) | Sample | Chest X-ray interpretation (text-based) |
| VQA-RAD | Single-turn | Academic | Sample | Radiology visual QA (text-based) |
| Path-VQA | Single-turn | Academic | Sample | Pathology visual QA (text-based) |
| HealthBench | Multi-turn | Industry (OpenAI) | 5K | Physician-rubric evaluation |
| MedAgentBench | Agentic | Academic (Stanford) | 100 | Agentic clinical task evaluation on FHIR |