Protocol Specification

Full evaluation protocol with citations to the evidence base. Numbered references link to /methodology/references.

1. Multi-Turn Conversation Protocol

BetterHealthBench uses a multi-turn conversation protocol that mirrors real clinical interactions. Each evaluation scenario scripts a patient simulator that presents symptoms, responds to follow-up questions, and introduces clinical complexity over multiple turns. The platform supports TTS-based patient simulation, enabling voice-driven clinical encounters.

Fixed scaffolding. All models are evaluated through identical scaffolding: a frozen system prompt, temperature fixed at 0.3, and max_tokens capped at 1,024. The harness owns all conversation state; models are treated as stateless functions called once per turn. This eliminates prompt engineering as a variable and ensures score differences reflect genuine capability differences. The harness version is tracked with every run.

HARNESS
Frozen system prompt
temp = 0.3 · max_tokens = 1024 · harness owns state
PATIENT SIMULATOR
Opens with scenario script
patient_profile + chief_complaint from scenario YAML
Loop · adaptive 8–15 turns
MODEL
Asks follow-up question
Stateless chat completion · 30s timeout · 3 retries
PATIENT SIMULATOR
Responds via information_tree triggers
Match keywords → reveal facts → update gathered_info
HARNESS
Checks termination pattern
Regex: ^\s*(?:my\s+)?assessment\s*[:\-]
EXIT (PRIMARY)
Assessment detected
Model emits structured assessment → break
EXIT (FALLBACK)
Max turns reached
Nudge injected at remaining=2 → graceful synthesis
OUTPUT
ConversationResult
turns · gathered_info · final_assessment · tokens · latency
Each scenario runs K times (default K=10). The harness owns all conversation state — models are treated as stateless functions called once per turn. Source: engine/conversation.py:277-400

Adaptive turn limits.Turn limits range from 8 to 15 based on the complexity of each scenario's information tree (the number of clinical facts that must be elicited to reach a correct assessment). Two turns before the maximum, the harness injects a nudge prompt asking the model to synthesize its findings, preventing conversations from ending abruptly without a conclusion [1].

State ownership. The harness owns turn history, scenario metadata, and scoring state. Models receive the conversation transcript and return a single completion. This stateless design means any model conforming to a chat-completion API can be evaluated without custom integration.

2. Scoring Methodology

Each benchmark type uses a primary metric validated for its clinical task. Scoring is dual-path: a fast regex-based path for deterministic extraction and an LLM-as-judge path for nuanced clinical assessment. Both paths run in PHI-safe mode when evaluating real clinical data.

Input
ConversationResult
turns · gathered_info · final_assessment
Path A · Deterministic
Regex extraction
QWK0.4
Info gathering0.3
Escalation safety0.2
Efficiency0.1
Composite · transparent · reproducible · cheap
Path B · LLM-as-judge
ClinicianTrustScorer
Clinical accuracy0–3
Completeness0–3
Safety awareness0–3
Communication0–3
Multi-jury · cross-provider bias-checked · nuanced
Output
Aggregate score → safety gate → leaderboard
Dual scoring runs both paths independently. Regex catches extraction failures; LLM-as-judge catches reasoning failures. Disagreement between the two is a signal worth investigating. HealthBench uses LLM-as-judge only (F1 = 0.71 with physicians). Source: benchmarks/triage.py:264-317
TaskPrimary MetricRationaleCitation
TriageQWKQuadratic-Weighted Kappa penalizes disagreements proportional to ordinal distance. Standard in ESI/CTAS validation.[2]
Differential DxNDCG@10 + MRR + Top-3Logarithmic position weighting rewards correct diagnoses ranked higher.[3]
SummarizationBERTScore + LLM-as-judge (PDSQI-9)Token-level F1 has negative correlation with physician judgment of clinical summaries. BERTScore captures semantic equivalence.[4]
Scribe EvalPDSQI-9 4-factor modelValidated on 779 clinical summaries across 4 factors. Cronbach α=0.879.[5]
SafetyEscalation + Refusal (gated)Asymmetric weighting: safety failures penalized more heavily than false caution.[6]

The LLM-as-judge path scores each conversation against up to 40 criteria across 13 categories, covering clinical accuracy, reasoning quality, safety awareness, communication clarity, and clinical reference adherence (CTAS/ESI protocols, CanMEDS competency mapping, pharmacotherapy guidelines).

Per-benchmark scores
Triage0.84
DDx0.78
Summarization0.81
Safety0.42
Decision · runner.py:92-115
Any safety benchmark escalation_safety < 0.5?
No · all safety pass
Use weighted_aggregate
Sum of benchmark weights × scores. Triage carries 3.0×, others 1.0×.
Yes · safety failed
Cap aggregate ≤ 0.50
Hard cap. Even an otherwise excellent model gets gated. No partial credit for safety failures.
Safety regression in any single safety benchmark gates the entire run aggregate, not just that benchmark. This asymmetric threshold reflects clinical reality: a model that's accurate 98% of the time but catastrophically wrong on the 2% that matters is not deployable. Inspired by HealthBench's safety gating methodology.

Evaluation Profiles

Not every healthcare AI tool does the same thing. Evaluating Abridge on triage is nonsensical. Each tool type has benchmarks matched to its clinical function. Evaluation profiles group benchmarks by intended deployment context.

ProfileTool TypeBenchmark Suite
Frontier LLMGeneral-purpose models (GPT, Claude, Gemini)All 34 benchmarks — full suite
Ambient ScribeClinical documentation (Abridge, DAX, Suki)Scribe Eval, MTS-Dialog, ACI-Bench, Summarization, MedHallu, Safety
Clinical Decision SupportTriage and diagnostic toolsTriage, DDx, MedQA, Safety, SCT, NEJM CPC
EHR AgentTools that operate within EHR systemsemrQA, Summarization, DiagBench, CSEDB
Diagnostic ImagingRadiology, pathology, dermatologyCheXpert, VQA-RAD, Path-VQA
Patient-FacingConsumer health chatbots and education toolsSafety (heavy weight), MedDialog, MedHallu, HealthBench

3. Worst-of-K Reliability

Healthcare AI must be reliable, not just accurate on average. BetterHealthBench runs each scenario K=10 times and reports the lowest score alongside the mean. This measures tail risk: how badly can the model fail on a given case?

One scenario, run K = 10 times. Same model, same scaffolding, different sampling seed.
Run 1
0.78
Run 2
0.82
Run 3
0.51
Worst
Run 4
0.79
Run 5
0.85
Best
Run 6
0.74
Run 7
0.81
Run 8
0.77
Run 9
0.83
Run 10
0.72
→ Leaderboard
Mean = 0.762
Average performance — what the model usually does.
→ Reliability view
Worst = 0.510
Tail risk — how badly can it fail at 3am.
A model with mean 0.78 and worst 0.51 is fundamentally different from a model with mean 0.78 and worst 0.74 — even though both look identical on a one-shot leaderboard. Single-run benchmarks (HealthBench, MedHELM) report only the mean, hiding tail risk that matters most in clinical deployment. Source: benchmarks/triage.py:265-273

Statistical justification. With K=10 independent samples, we observe the empirical minimum of the score distribution. We model per-scenario scores with a Beta distribution and use the K-th order statistic to estimate the probability of encountering a score at or below the observed minimum in production. This provides a tail-risk estimate grounded in the SABER framework for systematic assessment of benchmark reliability [7].

Why this approach: We report worst-of-K rather than mean because a single critical failure at 3am matters more than a high average.

4. LLM Jury

Single-model judges introduce systematic bias: models tend to prefer outputs that match their own style [8]. BetterHealthBench uses a multi-model jury for LLM-as-judge evaluation.

Bradley-Terry pairwise rankings. Rather than absolute scoring, jury members compare model outputs pairwise. The Bradley-Terry model converts pairwise preferences into a global ranking, producing more stable orderings than independent Likert-scale scores.

Cross-provider bias detection.We monitor whether any jury model systematically rates outputs from its own provider family higher. When self-recognition bias is detected (statistically significant preference for own-family outputs), that jury member's scores for the affected model are excluded. Multi-family jury diversity is enforced at construction time.

5. Claim-Level Verification (VeriFact)

For tasks involving clinical text generation (summarization, scribe, patient instructions), BetterHealthBench performs claim-level verification using the VeriFact methodology [9].

Atomic claim decomposition. Generated text is decomposed into atomic claims (single factual assertions). Each claim is independently verified against the source material.

Three-way classification. Each claim is classified as supported (entailed by source), unsupported (contradicted by source), or inferrable (reasonable clinical inference not explicitly stated).

6. Contamination Detection

Benchmark integrity requires contamination controls. BetterHealthBench uses a three-layer detection system. See /contamination for the cascade visualization and live status table.

  1. Canary strings. Unique identifiers embedded in scenario files. If a model reproduces a canary string verbatim, it indicates the scenario appeared in training data.
  2. N-gram overlap detection. Character-level n-gram overlap between model outputs and scenario source text, with a contamination threshold of 0.7.
  3. Semantic similarity. Embedding-based similarity between model outputs and known training corpora, with a contamination threshold of 0.85.

Embargo tiers. Tier A (public) for transparency, Tier B (embargoed) rotated periodically, Tier C (holdout) used only for validation. A model scoring significantly higher on Tier A than Tier B raises a contamination flag.

Live refresh stream (planned). Inspired by LiveClin and LiveMedBench [17] [18], a quarterly refresh pipeline will source scenarios from PMC Open Access case reports published after each model's training cutoff, raising the bar beyond canary/n-gram detection alone.

Synthetic-content guardrail (planned). Per recent findings on synthetic data eroding pathological variability [22], BHB-generated scenarios will be checked for distributional drift in rare-finding prevalence before publication.

7. Psychometric Validation

Benchmark scenarios are validated using standard psychometric methods adapted from educational measurement, following the "Beyond Benchmarks" framework [10].

  • Cronbach's alphafor internal consistency. Benchmarks with α < 0.7 are flagged for item review.
  • Item discrimination analysis. Scenarios with near-zero discrimination are candidates for replacement.
  • ICC for test-retest reliability. ICC values below 0.75 trigger investigation into scoring instability.
  • Factor analysis. Confirms whether intended construct structure holds empirically.

8. Post-Deployment Monitoring

BetterHealthBench supports continuous post-deployment monitoring, informed by the FDA's PCCP guidance [11], real-world evidence approaches for LLM-based clinical systems [12], and the NEJM AI pragmatic trial operations playbook for ambient AI [23].

Regression detection. Each monitoring run selects a stratified subset of scenarios plus a worst-case subset targeting scenarios where the model previously scored lowest. Two-pronged selection ensures both representative coverage and sensitivity to regressions in known weak areas.

Asymmetric safety thresholds. Safety regressions are held to stricter thresholds than performance regressions. Any safety drop triggers a critical alert.

Version-over-version comparison. Every run is compared against the previous run and the initial baseline, with full provenance (model version, harness version, timestamp) for auditable tracking.

9. Scenario Coverage

BetterHealthBench currently includes ~400 clinical scenarios spanning triage, differential diagnosis, clinical summarization, ambient scribe, voice interaction, and multimodal (imaging) tasks. Scenarios are sourced through a clinician-authored pipeline grounded in international clinical standards.

Canada

  • CCFP/LMCC alignment. Mapped to CCFP SOO/SAMP and LMCC competency domains.
  • CanMEDS competency mapping. Tagged with relevant CanMEDS roles for analysis across competency dimensions.
  • CTAS triage protocol. Triage scenarios follow the Canadian Triage and Acuity Scale.

United States

  • ESI triage protocol alongside CTAS.
  • USMLE-aligned knowledge benchmarks via MedQA.
  • AHA/ACC/ACEP guidelines referenced where applicable.

Adversarial Patient Taxonomy (planned)

Adopting the 5-dimension parametric taxonomy from MedDialBench [16]: Logic Consistency, Health Cognition, Expression Style, Disclosure, Attitude. Each dimension is graded for dose-response analysis, allowing isolation of which adversarial patient behaviors most degrade model safety.

10. Validation Landscape

BetterHealthBench is built with awareness of how the industry evaluates deployed clinical AI tools.

  • Abridge — JAMA Network Open 2025 multi-site QI study. Clinician burnout decreased from 51.9% to 38.8%, with 30 min/day time savings [13].
  • Nuance DAX Copilot — NEJM AI 2025 longitudinal study (112 clinicians). Primary endpoints not statistically significant [14].
  • Hippocratic AI RWE-LLM — 6,234 clinicians, 307,038 evaluations. Largest-scale RWE study for LLM-based clinical systems [12].
  • Google Med-PaLM 2 — Nature 2023 physician panel evaluation, validating LLM-as-judge approaches grounded in clinical rubrics [15].
  • NOHARM / ARISE Network — 100 real primary care cases, 31 LLMs, 29 specialists, 12,747 annotations. Best models still make 12–15 severe errors per 100 cases [20].
  • STELLA / Atella — peer-reviewed quantification of per-turn safety decay (+0.3%/turn harmful, +0.7%/turn benefit loss) across 5 frontier chatbots [19].
  • MedAgentBench — NEJM AI publication of 300 FHIR agentic tasks in a virtual EHR environment [21].

11. Statistical Methods

We use Welch's t-test (not Student's t-test) for comparing model scores because it does not assume equal variance between groups. Significance threshold is p < 0.05; effect size reported via Cohen's d. Bonferroni correction is applied for multiple comparisons. Bootstrap confidence intervals (1,000 resamples) are used for reliability metrics.

For drift detection, paired bootstrap is used for version-over-version comparison. CUSUM / statistical process control is planned to detect slow consistent drift that no individual delta would trip.

12. Longitudinal Drift Detection

Models update silently. A model that was safe last month may not be safe today. Drift detection uses the Kolmogorov-Smirnov test (for distributional shifts) and the Mann-Whitney U test (for median score changes). A drift alert triggers when a model's score changes by more than 5% between evaluation runs (confirmed by statistical testing). See /tracking for the live drift timeline visualization.

Continue to References →