Benchmarks

BenchmarkCategoryScenariosDescription
Clinical Scribescribe25Medical scribe evaluation: generate structured clinical notes from encounter transcripts.
Clinical Summarizationclinical40Summarize patient encounters into structured clinical notes.
Differential Diagnosisclinical75Generate and rank differential diagnoses from patient presentations.
EMR Workflowclinical25Simulated electronic medical record tasks: chart review, medication reconciliation, lab interpretation, referral letters, preventive care gaps.
MedQA (USMLE)standard_qa1273US Medical Licensing Exam multiple-choice questions.
MMLU Medicalstandard_qa772Clinical knowledge, medical genetics, anatomy, professional medicine.
Multi-Specialty Consultclinical30Cross-specialty clinical reasoning: cardiology, nephrology, oncology, psychiatry, pediatrics consultations requiring information synthesis across domains.
PubMedQAstandard_qa500Biomedical literature yes/no/maybe questions with PubMed abstracts.
Triageclinical50Emergency triage assessment: classify patient acuity from multi-turn conversations.

All 30 Benchmarks

BenchmarkTypeOriginScenariosDescription
TriageMulti-turnBetterHealthBench103+Clinical triage severity assignment
Differential DxMulti-turnBetterHealthBench93+Differential diagnosis reasoning
SummarizationMulti-turnBetterHealthBench73+Clinical note summarization
Scribe EvalSingle-passBetterHealthBench25Ambient scribe note quality
Voice ScribeMulti-turnBetterHealthBench15Voice-to-text clinical documentation
SafetyBench-MedicalSingle-turnBetterHealthBench50Safety refusal/escalation
MedQASingle-turn MCQAcademic (USMLE)HuggingFaceUSMLE-style medical QA
MedMCQASingle-turn MCQAcademic (AIIMS/NEET)HuggingFaceIndian medical entrance exam
MMLU MedicalSingle-turn MCQAcademic (Hendrycks)HuggingFace6 medical knowledge subjects
PubMedQASingle-turn MCQAcademic (PubMed)HuggingFaceBiomedical literature QA
MedXpertQASingle-turn MCQAcademicSampleExpert-level medical QA
MedCalc-BenchSingle-turnAcademicSampleMedical calculations
JAMA ClinicalSingle-turn MCQAcademic (JAMA)SampleJAMA Clinical Challenge cases
ClinicalSTSSingle-turnAcademicSampleSemantic textual similarity
MedHELMMultipleAcademic (Stanford)121 tasksStanford's 121-task framework
MedHalluSingle-turnAcademicSampleHallucination detection
LiveMedBenchSingle-turn MCQAcademicSamplePost-cutoff anti-contamination
SCTSingle-turnAcademic (Charlin)SampleScript concordance (reasoning under uncertainty)
NEJM CPCSingle-turnAcademic (NEJM)SampleClinicopathological conference cases
MedNLISingle-turnAcademic (PhysioNet)SampleMedical natural language inference
BioASQSingle-turnAcademic (BioASQ)SampleBiomedical semantic QA
emrQASingle-turnAcademic (i2b2)SampleEHR question answering
DiagBenchSingle-turnAcademic (MIMIC-IV)SampleDiagnostic reasoning from clinical data
CSEDBSingle-turnAcademicSampleCascading clinical scenario decisions
MedDialogSingle-turnAcademicSampleMedical dialogue quality
CheXpertSingle-turnAcademic (Stanford)SampleChest X-ray interpretation (text-based)
VQA-RADSingle-turnAcademicSampleRadiology visual QA (text-based)
Path-VQASingle-turnAcademicSamplePathology visual QA (text-based)
HealthBenchMulti-turnIndustry (OpenAI)5KPhysician-rubric evaluation
MedAgentBenchAgenticAcademic (Stanford)100Agentic clinical task evaluation on FHIR

30 benchmarks shown. BetterHealthBench originals include multi-turn clinical scenarios. Academic and industry benchmarks are integrated from established sources.