Protocol Specification
Full evaluation protocol with citations to the evidence base. Numbered references link to /methodology/references.
1. Multi-Turn Conversation Protocol
BetterHealthBench uses a multi-turn conversation protocol that mirrors real clinical interactions. Each evaluation scenario scripts a patient simulator that presents symptoms, responds to follow-up questions, and introduces clinical complexity over multiple turns. The platform supports TTS-based patient simulation, enabling voice-driven clinical encounters.
Fixed scaffolding. All models are evaluated through identical scaffolding: a frozen system prompt, temperature fixed at 0.3, and max_tokens capped at 1,024. The harness owns all conversation state; models are treated as stateless functions called once per turn. This eliminates prompt engineering as a variable and ensures score differences reflect genuine capability differences. The harness version is tracked with every run.
engine/conversation.py:277-400Adaptive turn limits.Turn limits range from 8 to 15 based on the complexity of each scenario's information tree (the number of clinical facts that must be elicited to reach a correct assessment). Two turns before the maximum, the harness injects a nudge prompt asking the model to synthesize its findings, preventing conversations from ending abruptly without a conclusion [1].
State ownership. The harness owns turn history, scenario metadata, and scoring state. Models receive the conversation transcript and return a single completion. This stateless design means any model conforming to a chat-completion API can be evaluated without custom integration.
2. Scoring Methodology
Each benchmark type uses a primary metric validated for its clinical task. Scoring is dual-path: a fast regex-based path for deterministic extraction and an LLM-as-judge path for nuanced clinical assessment. Both paths run in PHI-safe mode when evaluating real clinical data.
benchmarks/triage.py:264-317| Task | Primary Metric | Rationale | Citation |
|---|---|---|---|
| Triage | QWK | Quadratic-Weighted Kappa penalizes disagreements proportional to ordinal distance. Standard in ESI/CTAS validation. | [2] |
| Differential Dx | NDCG@10 + MRR + Top-3 | Logarithmic position weighting rewards correct diagnoses ranked higher. | [3] |
| Summarization | BERTScore + LLM-as-judge (PDSQI-9) | Token-level F1 has negative correlation with physician judgment of clinical summaries. BERTScore captures semantic equivalence. | [4] |
| Scribe Eval | PDSQI-9 4-factor model | Validated on 779 clinical summaries across 4 factors. Cronbach α=0.879. | [5] |
| Safety | Escalation + Refusal (gated) | Asymmetric weighting: safety failures penalized more heavily than false caution. | [6] |
The LLM-as-judge path scores each conversation against up to 40 criteria across 13 categories, covering clinical accuracy, reasoning quality, safety awareness, communication clarity, and clinical reference adherence (CTAS/ESI protocols, CanMEDS competency mapping, pharmacotherapy guidelines).
escalation_safety < 0.5?Evaluation Profiles
Not every healthcare AI tool does the same thing. Evaluating Abridge on triage is nonsensical. Each tool type has benchmarks matched to its clinical function. Evaluation profiles group benchmarks by intended deployment context.
| Profile | Tool Type | Benchmark Suite |
|---|---|---|
| Frontier LLM | General-purpose models (GPT, Claude, Gemini) | All 34 benchmarks — full suite |
| Ambient Scribe | Clinical documentation (Abridge, DAX, Suki) | Scribe Eval, MTS-Dialog, ACI-Bench, Summarization, MedHallu, Safety |
| Clinical Decision Support | Triage and diagnostic tools | Triage, DDx, MedQA, Safety, SCT, NEJM CPC |
| EHR Agent | Tools that operate within EHR systems | emrQA, Summarization, DiagBench, CSEDB |
| Diagnostic Imaging | Radiology, pathology, dermatology | CheXpert, VQA-RAD, Path-VQA |
| Patient-Facing | Consumer health chatbots and education tools | Safety (heavy weight), MedDialog, MedHallu, HealthBench |
3. Worst-of-K Reliability
Healthcare AI must be reliable, not just accurate on average. BetterHealthBench runs each scenario K=10 times and reports the lowest score alongside the mean. This measures tail risk: how badly can the model fail on a given case?
benchmarks/triage.py:265-273Statistical justification. With K=10 independent samples, we observe the empirical minimum of the score distribution. We model per-scenario scores with a Beta distribution and use the K-th order statistic to estimate the probability of encountering a score at or below the observed minimum in production. This provides a tail-risk estimate grounded in the SABER framework for systematic assessment of benchmark reliability [7].
Why this approach: We report worst-of-K rather than mean because a single critical failure at 3am matters more than a high average.
4. LLM Jury
Single-model judges introduce systematic bias: models tend to prefer outputs that match their own style [8]. BetterHealthBench uses a multi-model jury for LLM-as-judge evaluation.
Bradley-Terry pairwise rankings. Rather than absolute scoring, jury members compare model outputs pairwise. The Bradley-Terry model converts pairwise preferences into a global ranking, producing more stable orderings than independent Likert-scale scores.
Cross-provider bias detection.We monitor whether any jury model systematically rates outputs from its own provider family higher. When self-recognition bias is detected (statistically significant preference for own-family outputs), that jury member's scores for the affected model are excluded. Multi-family jury diversity is enforced at construction time.
5. Claim-Level Verification (VeriFact)
For tasks involving clinical text generation (summarization, scribe, patient instructions), BetterHealthBench performs claim-level verification using the VeriFact methodology [9].
Atomic claim decomposition. Generated text is decomposed into atomic claims (single factual assertions). Each claim is independently verified against the source material.
Three-way classification. Each claim is classified as supported (entailed by source), unsupported (contradicted by source), or inferrable (reasonable clinical inference not explicitly stated).
6. Contamination Detection
Benchmark integrity requires contamination controls. BetterHealthBench uses a three-layer detection system. See /contamination for the cascade visualization and live status table.
- Canary strings. Unique identifiers embedded in scenario files. If a model reproduces a canary string verbatim, it indicates the scenario appeared in training data.
- N-gram overlap detection. Character-level n-gram overlap between model outputs and scenario source text, with a contamination threshold of 0.7.
- Semantic similarity. Embedding-based similarity between model outputs and known training corpora, with a contamination threshold of 0.85.
Embargo tiers. Tier A (public) for transparency, Tier B (embargoed) rotated periodically, Tier C (holdout) used only for validation. A model scoring significantly higher on Tier A than Tier B raises a contamination flag.
Live refresh stream (planned). Inspired by LiveClin and LiveMedBench [17] [18], a quarterly refresh pipeline will source scenarios from PMC Open Access case reports published after each model's training cutoff, raising the bar beyond canary/n-gram detection alone.
Synthetic-content guardrail (planned). Per recent findings on synthetic data eroding pathological variability [22], BHB-generated scenarios will be checked for distributional drift in rare-finding prevalence before publication.
7. Psychometric Validation
Benchmark scenarios are validated using standard psychometric methods adapted from educational measurement, following the "Beyond Benchmarks" framework [10].
- Cronbach's alphafor internal consistency. Benchmarks with α < 0.7 are flagged for item review.
- Item discrimination analysis. Scenarios with near-zero discrimination are candidates for replacement.
- ICC for test-retest reliability. ICC values below 0.75 trigger investigation into scoring instability.
- Factor analysis. Confirms whether intended construct structure holds empirically.
8. Post-Deployment Monitoring
BetterHealthBench supports continuous post-deployment monitoring, informed by the FDA's PCCP guidance [11], real-world evidence approaches for LLM-based clinical systems [12], and the NEJM AI pragmatic trial operations playbook for ambient AI [23].
Regression detection. Each monitoring run selects a stratified subset of scenarios plus a worst-case subset targeting scenarios where the model previously scored lowest. Two-pronged selection ensures both representative coverage and sensitivity to regressions in known weak areas.
Asymmetric safety thresholds. Safety regressions are held to stricter thresholds than performance regressions. Any safety drop triggers a critical alert.
Version-over-version comparison. Every run is compared against the previous run and the initial baseline, with full provenance (model version, harness version, timestamp) for auditable tracking.
9. Scenario Coverage
BetterHealthBench currently includes ~400 clinical scenarios spanning triage, differential diagnosis, clinical summarization, ambient scribe, voice interaction, and multimodal (imaging) tasks. Scenarios are sourced through a clinician-authored pipeline grounded in international clinical standards.
Canada
- CCFP/LMCC alignment. Mapped to CCFP SOO/SAMP and LMCC competency domains.
- CanMEDS competency mapping. Tagged with relevant CanMEDS roles for analysis across competency dimensions.
- CTAS triage protocol. Triage scenarios follow the Canadian Triage and Acuity Scale.
United States
- ESI triage protocol alongside CTAS.
- USMLE-aligned knowledge benchmarks via MedQA.
- AHA/ACC/ACEP guidelines referenced where applicable.
Adversarial Patient Taxonomy (planned)
Adopting the 5-dimension parametric taxonomy from MedDialBench [16]: Logic Consistency, Health Cognition, Expression Style, Disclosure, Attitude. Each dimension is graded for dose-response analysis, allowing isolation of which adversarial patient behaviors most degrade model safety.
10. Validation Landscape
BetterHealthBench is built with awareness of how the industry evaluates deployed clinical AI tools.
- Abridge — JAMA Network Open 2025 multi-site QI study. Clinician burnout decreased from 51.9% to 38.8%, with 30 min/day time savings [13].
- Nuance DAX Copilot — NEJM AI 2025 longitudinal study (112 clinicians). Primary endpoints not statistically significant [14].
- Hippocratic AI RWE-LLM — 6,234 clinicians, 307,038 evaluations. Largest-scale RWE study for LLM-based clinical systems [12].
- Google Med-PaLM 2 — Nature 2023 physician panel evaluation, validating LLM-as-judge approaches grounded in clinical rubrics [15].
- NOHARM / ARISE Network — 100 real primary care cases, 31 LLMs, 29 specialists, 12,747 annotations. Best models still make 12–15 severe errors per 100 cases [20].
- STELLA / Atella — peer-reviewed quantification of per-turn safety decay (+0.3%/turn harmful, +0.7%/turn benefit loss) across 5 frontier chatbots [19].
- MedAgentBench — NEJM AI publication of 300 FHIR agentic tasks in a virtual EHR environment [21].
11. Statistical Methods
We use Welch's t-test (not Student's t-test) for comparing model scores because it does not assume equal variance between groups. Significance threshold is p < 0.05; effect size reported via Cohen's d. Bonferroni correction is applied for multiple comparisons. Bootstrap confidence intervals (1,000 resamples) are used for reliability metrics.
For drift detection, paired bootstrap is used for version-over-version comparison. CUSUM / statistical process control is planned to detect slow consistent drift that no individual delta would trip.
12. Longitudinal Drift Detection
Models update silently. A model that was safe last month may not be safe today. Drift detection uses the Kolmogorov-Smirnov test (for distributional shifts) and the Mann-Whitney U test (for median score changes). A drift alert triggers when a model's score changes by more than 5% between evaluation runs (confirmed by statistical testing). See /tracking for the live drift timeline visualization.