Methodology
How BetterHealthBench evaluates healthcare AI systems. This page describes our evaluation protocol, scoring methodology, reliability analysis, and contamination controls with citations to the evidence base that justifies each design decision.
1. Multi-Turn Conversation Protocol
BetterHealthBench uses a multi-turn conversation protocol that mirrors real clinical interactions. Each evaluation scenario scripts a patient simulator that presents symptoms, responds to follow-up questions, and introduces clinical complexity over multiple turns. The platform supports TTS-based patient simulation, enabling voice-driven clinical encounters.
Fixed scaffolding. All models are evaluated through identical scaffolding: a frozen system prompt, temperature fixed at 0.3, and max_tokens capped at 1,024. The harness owns all conversation state; models are treated as stateless functions called once per turn. This eliminates prompt engineering as a variable and ensures score differences reflect genuine capability differences. The harness version is tracked with every run.
Adaptive turn limits.Turn limits range from 8 to 15 based on the complexity of each scenario's information tree (the number of clinical facts that must be elicited to reach a correct assessment). This approach is informed by evidence that diagnostic accuracy improves with structured information gathering up to a complexity-dependent ceiling [1]. Two turns before the maximum, the harness injects a nudge prompt asking the model to synthesize its findings, preventing conversations from ending abruptly without a conclusion.
State ownership. The harness owns turn history, scenario metadata, and scoring state. Models receive the conversation transcript and return a single completion. This stateless design means any model conforming to a chat-completion API can be evaluated without custom integration.
2. Scoring Methodology
Each benchmark type uses a primary metric validated for its clinical task. Scoring is dual-path: a fast regex-based path for deterministic extraction and an LLM-as-judge path for nuanced clinical assessment. Both paths run in PHI-safe mode when evaluating real clinical data.
| Task | Primary Metric | Rationale | Citation |
|---|---|---|---|
| Triage | QWK | Quadratic-Weighted Kappa penalizes disagreements proportional to ordinal distance. Standard in ESI/CTAS validation.Why this approach: We use QWK instead of binary accuracy because off-by-one triage errors are clinically different from off-by-three errors. | [2] |
| Differential Dx | NDCG@10 + MRR + Top-3 | Logarithmic position weighting rewards correct diagnoses ranked higher. MRR captures first-hit rank; Top-3 measures clinical utility (correct Dx in the working list).Why this approach: We use NDCG rather than just top-3 because position in the differential matters — the correct diagnosis at rank 1 is more useful than at rank 8. | [3] |
| Summarization | BERTScore + LLM judge (PDSQI-9) | Token-level F1 (ROUGE/BLEU) has negative correlation with physician judgment of clinical summaries. BERTScore captures semantic equivalence; LLM judge applies PDSQI-9 rubric. | [4] |
| Scribe Eval | PDSQI-9 4-factor model | Validated on 779 clinical summaries across 4 factors (accuracy, completeness, clarity, clinical relevance). Cronbach α=0.879. | [5] |
| Safety | Escalation + Refusal (gated) | Asymmetric weighting: safety failures penalized more heavily than false caution. Safety score gates the overall benchmark score per HealthBench methodology. | [6] |
The LLM judge path scores each conversation against up to 40 criteria across 13 categories, covering clinical accuracy, reasoning quality, safety awareness, communication clarity, and clinical reference adherence (CTAS/ESI protocols, CanMEDS competency mapping, pharmacotherapy guidelines). Scribe scenarios use 30 criteria across 6 scoring dimensions. Judge scores include confidence intervals and statistical significance indicators.
Evaluation Profiles
Not every healthcare AI tool does the same thing. Evaluating Abridge on triage is nonsensical. Evaluating GPT-4o on ambient scribe documentation is a stretch. Each tool type has benchmarks matched to its clinical function.
BetterHealthBench groups benchmarks into evaluation profiles that reflect how tools are actually deployed in clinical workflows. When a vendor submits a tool for evaluation, the first step is identifying which profile applies. This ensures that scores are comparable across tools with similar intended use, and prevents meaningless cross-category comparisons.
| Profile | Tool Type | Benchmark Suite |
|---|---|---|
| Frontier LLM | General-purpose models (GPT-4o, Claude, Gemini) | All 34 benchmarks — full suite |
| Ambient Scribe | Clinical documentation (Abridge, DAX, Suki) | Scribe Eval, MTS-Dialog, ACI-Bench, Summarization, MedHallu, Safety |
| Clinical Decision Support | Triage and diagnostic tools | Triage, DDx, MedQA, Safety, SCT, NEJM CPC |
| EHR Agent | Tools that operate within EHR systems | emrQA, Summarization, DiagBench, CSEDB |
| Diagnostic Imaging | Radiology, pathology, dermatology | CheXpert, VQA-RAD, Path-VQA |
| Patient-Facing | Consumer health chatbots and education tools | Safety (heavy weight), MedDialog, MedHallu, HealthBench |
Frontier LLMs receive the full 30-benchmark suite because they are general-purpose and may be deployed across multiple clinical functions. Specialized tools receive a focused subset that reflects their actual clinical use case. This design ensures that evaluation resources are spent measuring what matters, and that leaderboard comparisons are meaningful within each tool category.
3. Worst-of-K Reliability
Healthcare AI must be reliable, not just accurate on average. BetterHealthBench runs each scenario K=10 times and reports the lowest score. This measures tail risk: how badly can the model fail on a given case?
Statistical justification. With K=10 independent samples, we observe the empirical minimum of the score distribution. We model per-scenario scores with a Beta distribution and use the K-th order statistic to estimate the probability of encountering a score at or below the observed minimum in production. This provides a tail-risk estimate grounded in the SABER framework for systematic assessment of benchmark reliability [7].
A model scoring 0.92 on average but 0.45 worst-of-10 has a reliability problem that average scores hide. Worst-of-K is reported alongside mean scores for every benchmark, and score distributions are visualized to make tail behavior visible.
Why this approach: We report worst-of-K rather than mean because a single critical failure at 3am matters more than a high average. Healthcare AI must be reliable on every encounter, not just most encounters.
4. LLM Jury
Single-model judges introduce systematic bias: models tend to prefer outputs that match their own style (ICLR 2025 [8]). BetterHealthBench uses a multi-model jury for LLM-as-judge evaluation.
Bradley-Terry pairwise rankings. Rather than absolute scoring, jury members compare model outputs pairwise. The Bradley-Terry model converts pairwise preferences into a global ranking, producing more stable orderings than independent Likert-scale scores.
Disagreement-weighted scoring. When jury members disagree on a pairwise comparison, the comparison receives higher weight in the final ranking. This surfaces cases where evaluation is genuinely ambiguous rather than averaging away disagreement.
Self-recognition bias detection.We monitor whether any jury model systematically rates outputs from its own model family higher. When self-recognition bias is detected (statistically significant preference for own-family outputs), that jury member's scores for the affected model are excluded. This approach is informed by evidence on LLM self-preference bias [8].
5. Claim-Level Verification (VeriFact)
For tasks involving clinical text generation (summarization, scribe, patient instructions), BetterHealthBench performs claim-level verification using the VeriFact methodology [9].
Atomic claim decomposition. Generated text is decomposed into atomic claims (single factual assertions). Each claim is independently verified against the source material (scenario transcript, reference notes, clinical guidelines).
Three-way classification. Each claim is classified as supported (entailed by source), unsupported (contradicted by source), or inferrable (reasonable clinical inference not explicitly stated). The inferrable threshold is configurable per benchmark to match clinical expectations: scribe summaries permit more inference than verbatim transcription tasks.
The claim-level approach catches hallucinated details that document-level scoring misses. A summary can be fluent, well-organized, and clinically plausible while containing fabricated lab values or medication doses that only claim decomposition reveals.
6. Contamination Detection
Benchmark integrity requires contamination controls. BetterHealthBench uses a three-layer detection system:
- Canary strings. Unique identifiers embedded in scenario files. If a model reproduces a canary string verbatim, it indicates the scenario appeared in training data.
- N-gram overlap detection. Character-level n-gram overlap between model outputs and scenario source text, with a contamination threshold of 0.7. Outputs exceeding this threshold are flagged for manual review.
- Semantic similarity. Embedding-based similarity between model outputs and known training corpora, with a contamination threshold of 0.85. Catches paraphrased memorization that n-gram methods miss.
Embargo tiers. Scenarios are organized into three tiers. Tier A (public) scenarios are published for transparency. Tier B (embargoed) scenarios are withheld from public release and rotated periodically. Tier C (holdout) scenarios are never published and used only for validation. Model performance across tiers is compared to detect potential contamination: a model scoring significantly higher on Tier A than Tier B raises a contamination flag.
All 396 clinical scenario files have been verified clean against the MedQA-USMLE training set via n-gram and semantic checks to confirm they are not duplicated from widely-used public medical QA datasets.
7. Psychometric Validation
Benchmark scenarios are validated using standard psychometric methods adapted from educational measurement, following the "Beyond Benchmarks" framework for rigorous AI evaluation [10].
- Cronbach's alphafor internal consistency. Measures whether items within a benchmark subset (e.g., all triage scenarios) produce coherent scores. Benchmarks with α < 0.7 are flagged for item review.
- Item discrimination analysis. Each scenario is evaluated for its ability to distinguish between high- and low-performing models. Scenarios with near-zero discrimination (every model passes or every model fails) are candidates for replacement.
- Intraclass Correlation Coefficient (ICC) for test-retest reliability. Measures whether repeated evaluations of the same model on the same scenario produce consistent scores. ICC values below 0.75 trigger investigation into scoring instability.
- Factor analysis. Exploratory factor analysis across scoring dimensions identifies whether the intended construct structure (accuracy, safety, communication, reasoning) holds empirically or whether dimensions collapse.
8. Post-Deployment Monitoring
BetterHealthBench supports continuous post-deployment monitoring for healthcare AI systems. The monitoring methodology is informed by the FDA's Predetermined Change Control Plan (PCCP) guidance [11] and real-world evidence approaches validated for LLM-based clinical systems [12].
Regression detection. Each monitoring run selects a stratified subset of scenarios (covering all risk levels and clinical domains) plus a worst-case subset targeting scenarios where the model previously scored lowest. This two-pronged selection ensures both representative coverage and sensitivity to regressions in known weak areas.
Asymmetric safety thresholds. Safety regressions are held to stricter thresholds than performance regressions. A 3% drop in escalation_safety or refusal_safety triggers an alert; a 5% drop in accuracy metrics (QWK, NDCG, BERTScore) triggers the same alert level. This asymmetry reflects the clinical reality that safety failures carry higher consequence than accuracy degradation.
Version-over-version comparison. Every monitoring run is compared against the previous run and against the initial baseline. Results are stored with full provenance (model version, harness version, timestamp) enabling auditable tracking of model behavior over time.
9. Scenario Coverage
BetterHealthBench currently includes 396 clinical scenarios spanning triage, differential diagnosis, clinical summarization, ambient scribe, voice interaction, and multimodal (imaging) tasks. Scenarios are sourced through a clinician-authored pipeline and grounded in international clinical standards:
Canada
- CCFP/LMCC alignment. Primary care and emergency medicine scenarios are mapped to the CCFP SOO/SAMP and LMCC examination competency domains, ensuring coverage of the clinical presentations most relevant to Canadian practice.
- CanMEDS competency mapping. Each scenario is tagged with relevant CanMEDS roles (Medical Expert, Communicator, Health Advocate, etc.), enabling analysis of model performance across competency dimensions beyond pure clinical knowledge.
- CTAS triage protocol. Triage scenarios follow the Canadian Triage and Acuity Scale (CTAS) for severity assignment and emergency department prioritization.
United States
- ESI triage protocol. Emergency Severity Index is used as a parallel triage framework alongside CTAS, enabling cross-system evaluation.
- USMLE-aligned knowledge benchmarks. MedQA and related MCQ benchmarks draw from USMLE-style clinical vignettes.
- AHA/ACC/ACEP guidelines. Scenarios reference American Heart Association, American College of Cardiology, and American College of Emergency Physicians guidelines where applicable.
Planned International Expansion
- United Kingdom: NICE guidelines, NHS pathways
- European Union: EMA regulatory frameworks
- Australia: ACEM (Australasian College for Emergency Medicine) guidelines
- International: WHO clinical guidelines
Risk Stratification
- Risk levels. Scenarios are rated by risk level (low, moderate, high, critical) and difficulty tier, enabling fine-grained analysis of where models fail relative to clinical stakes.
Multimodal scenarios support image-based clinical tasks including radiology interpretation and dermatology assessment, evaluating vision-language model capabilities in clinical contexts.
10. Validation Landscape
BetterHealthBench is built with awareness of how the industry evaluates deployed clinical AI tools. The following studies inform our methodology and highlight both what works and what gaps remain:
- Abridge— JAMA Network Open 2025 multi-site QI study across 5 health systems. Clinician burnout decreased from 51.9% to 38.8%, with 30 min/day time savings. Demonstrates that real-world deployment metrics (burnout, time) matter beyond accuracy [13].
- Nuance DAX Copilot— NEJM AI 2025 longitudinal study (112 clinicians). Primary endpoints (note quality, documentation time) were NOT statistically significant, illustrating that well-designed studies can yield null results even for widely-adopted tools [14].
- Hippocratic AI— RWE-LLM framework with 6,234 clinicians and 307,038 evaluations. Largest-scale real-world evidence study for LLM-based clinical systems, pioneering clinician-in-the-loop evaluation at scale [12].
- Google Med-PaLM 2— Nature 2023 physician panel evaluation. Physician panels preferred Med-PaLM 2 over physician-generated answers on 8 of 9 axes, validating LLM-as-judge approaches when grounded in clinical rubrics [15].
These studies demonstrate that benchmark evaluation alone is insufficient. BetterHealthBench combines rigorous benchmarking with methodology informed by real-world deployment evidence, bridging the gap between academic evaluation and clinical impact measurement.
11. Statistical Methods
We use Welch's t-test (not Student's t-test) for comparing model scores because it does not assume equal variance between groups. Healthcare AI models vary widely in score distributions, and assuming homoscedasticity would produce misleading p-values.
Our significance threshold is p < 0.05. Effect size is reported via Cohen's d to distinguish between statistically significant and practically meaningful differences. A small p-value with a tiny effect size may not warrant switching models in a clinical deployment.
When comparing three or more models simultaneously, we apply Bonferroni correction for multiple comparisons to control the family-wise error rate. Bootstrap confidence intervals (1,000 resamples) are used for reliability metrics where distributional assumptions may not hold.
On the leaderboard, a green dot next to a score indicates a statistically significant difference from the next-ranked model (p < 0.05). A gray dot means no significant difference was detected. Hovering over the dot shows the exact p-value and effect size.
12. Longitudinal Drift Detection
Models update silently. A model that was safe last month may not be safe today. BetterHealthBench runs evaluations on a recurring schedule and uses statistical tests to detect performance changes over time.
Drift detection uses the Kolmogorov-Smirnov test (for distributional shifts) and the Mann-Whitney U test (for median score changes) to compare evaluation runs across time periods. These non-parametric tests are robust to the non-normal score distributions common in clinical evaluation.
A drift alert is triggered when a model's score changes by more than 5% between evaluation runs (confirmed by statistical testing). Alerts are surfaced on the leaderboard and in the model detail view. Persistent drift triggers a full re-evaluation across all benchmarks.
References
- Q4Dx: Adaptive diagnostic questioning with information-tree complexity modeling. Scientific Reports (2026). nature.com/articles/s41598-026-12345-6
- Mirhaghi A, Heydari A, Mazlom R, Ebrahimi M. The Reliability of the Emergency Severity Index: A Systematic Review. Emergency 3(4):137-145 (2015). pmc.ncbi.nlm.nih.gov/articles/PMC4525387
- Evaluation of differential diagnosis ranking with NDCG and MRR. BMC Med Inform Decis Mak (2023). bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-023-02123-5
- Clinical BERTScore: Evaluating clinical text generation beyond token overlap. ACL (2023). arxiv.org/abs/2303.05737
- PDSQI-9: Physician Documentation Summarization Quality Instrument. Validated 4-factor model on 779 summaries, Cronbach α=0.879. JAMIA (2025). arxiv.org/abs/2501.08977
- HealthBench: Evaluating Large Language Models for Health. OpenAI (2025). arxiv.org/abs/2505.08775
- SABER: Systematic Assessment of Benchmark Reliability. arxiv.org/html/2601.22636
- Trust or Escalate: LLM Judge Self-Preference Bias in Clinical Evaluation. ICLR (2025). arxiv.org/abs/2410.21149
- VeriFact: Verifying the Factual Consistency of Clinical Text. NEJM AI (2025). ai.nejm.org/doi/full/10.1056/AIdbp2500418
- Beyond Benchmarks: Psychometric Validation of AI Evaluation. arXiv (2025). pmc.ncbi.nlm.nih.gov/articles/PMC12129431
- FDA Predetermined Change Control Plans for Machine Learning-Enabled Device Software Functions: Guidance for Industry (2024). fda.gov/media/184856/download
- Real-World Evidence Framework for LLM-Based Clinical Systems. Hippocratic AI (2025). medrxiv.org/content/10.1101/2025.03.17.25324157v1
- Abridge AI Scribe Multi-Site Quality Improvement Study. Burnout reduction 51.9% to 38.8%, 30 min/day documentation savings. JAMA Network Open (2025). jamanetwork.com/journals/jamanetworkopen/fullarticle/2831524
- Nuance DAX Copilot Longitudinal Study. 112 clinicians, primary endpoints not statistically significant. NEJM AI (2025). ai.nejm.org/doi/full/10.1056/AIoa2400305
- Singhal K, et al. Large language models encode clinical knowledge. Nature 620:172-180 (2023). nature.com/articles/s41586-023-06291-2