Protocol Specification

Full evaluation protocol with citations to the evidence base. Numbered references link to /methodology/references.

1. Multi-Turn Conversation Protocol

BetterHealthBench uses a multi-turn conversation protocol that mirrors real clinical interactions. Each evaluation scenario scripts a patient simulator that presents symptoms, responds to follow-up questions, and introduces clinical complexity over multiple turns. The platform supports TTS-based patient simulation, enabling voice-driven clinical encounters.

Fixed scaffolding. All models are evaluated through identical scaffolding: a frozen system prompt, temperature fixed at 0.3, and max_tokens capped at 1,024. The harness owns all conversation state; models are treated as stateless functions called once per turn. This eliminates prompt engineering as a variable and ensures score differences reflect genuine capability differences. The harness version is tracked with every run.

HARNESS

Frozen system prompt

temp = 0.3 · max_tokens = 1024 · harness owns state

PATIENT SIMULATOR

Opens with scenario script

patient_profile + chief_complaint from scenario YAML

Loop · adaptive 8–15 turns

MODEL

Asks follow-up question

Stateless chat completion · 30s timeout · 3 retries

PATIENT SIMULATOR

Responds via information_tree triggers

Match keywords → reveal facts → update gathered_info

HARNESS

Checks termination pattern

Regex: ^\s*(?:my\s+)?assessment\s*[:\-]

EXIT (PRIMARY)

Assessment detected

Model emits structured assessment → break

EXIT (FALLBACK)

Max turns reached

Nudge injected at remaining=2 → graceful synthesis

OUTPUT

ConversationResult

turns · gathered_info · final_assessment · tokens · latency

Each scenario runs K times (default K=10). The harness owns all conversation state — models are treated as stateless functions called once per turn. Source: engine/conversation.py:277-400

Adaptive turn limits.Turn limits range from 8 to 15 based on the complexity of each scenario's information tree (the number of clinical facts that must be elicited to reach a correct assessment). Two turns before the maximum, the harness injects a nudge prompt asking the model to synthesize its findings, preventing conversations from ending abruptly without a conclusion [1].

State ownership. The harness owns turn history, scenario metadata, and scoring state. Models receive the conversation transcript and return a single completion. This stateless design means any model conforming to a chat-completion API can be evaluated without custom integration.

2. Scoring Methodology

Each benchmark type uses a primary metric validated for its clinical task. Scoring is dual-path: a fast regex-based path for deterministic extraction and an LLM-as-judge path for nuanced clinical assessment. Both paths run in PHI-safe mode when evaluating real clinical data.

Input

ConversationResult

turns · gathered_info · final_assessment

Path A · Deterministic

Regex extraction

QWK0.4

Info gathering0.3

Escalation safety0.2

Efficiency0.1

Composite · transparent · reproducible · cheap

Path B · LLM-as-judge

ClinicianTrustScorer

Clinical accuracy0–3

Completeness0–3

Safety awareness0–3

Communication0–3

Multi-jury · cross-provider bias-checked · nuanced

Output

Aggregate score → safety gate → leaderboard

Dual scoring runs both paths independently. Regex catches extraction failures; LLM-as-judge catches reasoning failures. Disagreement between the two is a signal worth investigating. HealthBench uses LLM-as-judge only (F1 = 0.71 with physicians). Source: benchmarks/triage.py:264-317

Task	Primary Metric	Rationale	Citation
Triage	QWK	Quadratic-Weighted Kappa penalizes disagreements proportional to ordinal distance. Standard in ESI/CTAS validation.	[2]
Differential Dx	NDCG@10 + MRR + Top-3	Logarithmic position weighting rewards correct diagnoses ranked higher.	[3]
Summarization	BERTScore + LLM-as-judge (PDSQI-9)	Token-level F1 has negative correlation with physician judgment of clinical summaries. BERTScore captures semantic equivalence.	[4]
Scribe Eval	PDSQI-9 4-factor model	Validated on 779 clinical summaries across 4 factors. Cronbach α=0.879.	[5]
Safety	Escalation + Refusal (gated)	Asymmetric weighting: safety failures penalized more heavily than false caution.	[6]

The LLM-as-judge path scores each conversation against up to 40 criteria across 13 categories, covering clinical accuracy, reasoning quality, safety awareness, communication clarity, and clinical reference adherence (CTAS/ESI protocols, CanMEDS competency mapping, pharmacotherapy guidelines).

Per-benchmark scores

Triage0.84

DDx0.78

Summarization0.81

Safety0.42

Decision · runner.py:92-115

Any safety benchmark escalation_safety < 0.5?

No · all safety pass

Use weighted_aggregate

Sum of benchmark weights × scores. Triage carries 3.0×, others 1.0×.

Yes · safety failed

Cap aggregate ≤ 0.50

Hard cap. Even an otherwise excellent model gets gated. No partial credit for safety failures.

Safety regression in any single safety benchmark gates the entire run aggregate, not just that benchmark. This asymmetric threshold reflects clinical reality: a model that's accurate 98% of the time but catastrophically wrong on the 2% that matters is not deployable. Inspired by HealthBench's safety gating methodology.

Evaluation Profiles

Not every healthcare AI tool does the same thing. Evaluating Abridge on triage is nonsensical. Each tool type has benchmarks matched to its clinical function. Evaluation profiles group benchmarks by intended deployment context.

Profile	Tool Type	Benchmark Suite
Frontier LLM	General-purpose models (GPT, Claude, Gemini)	All 34 benchmarks — full suite
Ambient Scribe	Clinical documentation (Abridge, DAX, Suki)	Scribe Eval, MTS-Dialog, ACI-Bench, Summarization, MedHallu, Safety
Clinical Decision Support	Triage and diagnostic tools	Triage, DDx, MedQA, Safety, SCT, NEJM CPC
EHR Agent	Tools that operate within EHR systems	emrQA, Summarization, DiagBench, CSEDB
Diagnostic Imaging	Radiology, pathology, dermatology	CheXpert, VQA-RAD, Path-VQA
Patient-Facing	Consumer health chatbots and education tools	Safety (heavy weight), MedDialog, MedHallu, HealthBench

3. Worst-of-K Reliability

Healthcare AI must be reliable, not just accurate on average. BetterHealthBench runs each scenario K=10 times and reports the lowest score alongside the mean. This measures tail risk: how badly can the model fail on a given case?

One scenario, run K = 10 times. Same model, same scaffolding, different sampling seed.

Run 1

0.78

Run 2

0.82

Run 3

0.51

Worst

Run 4

0.79

Run 5

0.85

Best

Run 6

0.74

Run 7

0.81

Run 8

0.77

Run 9

0.83

Run 10

0.72

→ Leaderboard

Mean = 0.762

Average performance — what the model usually does.

→ Reliability view

Worst = 0.510

Tail risk — how badly can it fail at 3am.

A model with mean 0.78 and worst 0.51 is fundamentally different from a model with mean 0.78 and worst 0.74 — even though both look identical on a one-shot leaderboard. Single-run benchmarks (HealthBench, MedHELM) report only the mean, hiding tail risk that matters most in clinical deployment. Source: benchmarks/triage.py:265-273

Statistical justification. With K=10 independent samples, we observe the empirical minimum of the score distribution. We model per-scenario scores with a Beta distribution and use the K-th order statistic to estimate the probability of encountering a score at or below the observed minimum in production. This provides a tail-risk estimate grounded in the SABER framework for systematic assessment of benchmark reliability [7].

Why this approach: We report worst-of-K rather than mean because a single critical failure at 3am matters more than a high average.

4. LLM Jury

Single-model judges introduce systematic bias: models tend to prefer outputs that match their own style [8]. BetterHealthBench uses a multi-model jury for LLM-as-judge evaluation.

Bradley-Terry pairwise rankings. Rather than absolute scoring, jury members compare model outputs pairwise. The Bradley-Terry model converts pairwise preferences into a global ranking, producing more stable orderings than independent Likert-scale scores.

Cross-provider bias detection.We monitor whether any jury model systematically rates outputs from its own provider family higher. When self-recognition bias is detected (statistically significant preference for own-family outputs), that jury member's scores for the affected model are excluded. Multi-family jury diversity is enforced at construction time.

5. Claim-Level Verification (VeriFact)

For tasks involving clinical text generation (summarization, scribe, patient instructions), BetterHealthBench performs claim-level verification using the VeriFact methodology [9].

Atomic claim decomposition. Generated text is decomposed into atomic claims (single factual assertions). Each claim is independently verified against the source material.

Three-way classification. Each claim is classified as supported (entailed by source), unsupported (contradicted by source), or inferrable (reasonable clinical inference not explicitly stated).

6. Contamination Detection

Benchmark integrity requires contamination controls. BetterHealthBench uses a three-layer detection system. See /contamination for the cascade visualization and live status table.

Canary strings. Unique identifiers embedded in scenario files. If a model reproduces a canary string verbatim, it indicates the scenario appeared in training data.
N-gram overlap detection. Character-level n-gram overlap between model outputs and scenario source text, with a contamination threshold of 0.7.
Semantic similarity. Embedding-based similarity between model outputs and known training corpora, with a contamination threshold of 0.85.

Embargo tiers. Tier A (public) for transparency, Tier B (embargoed) rotated periodically, Tier C (holdout) used only for validation. A model scoring significantly higher on Tier A than Tier B raises a contamination flag.

Live refresh stream (planned). Inspired by LiveClin and LiveMedBench [17] [18], a quarterly refresh pipeline will source scenarios from PMC Open Access case reports published after each model's training cutoff, raising the bar beyond canary/n-gram detection alone.

Synthetic-content guardrail (planned). Per recent findings on synthetic data eroding pathological variability [22], BHB-generated scenarios will be checked for distributional drift in rare-finding prevalence before publication.

7. Psychometric Validation

Benchmark scenarios are validated using standard psychometric methods adapted from educational measurement, following the "Beyond Benchmarks" framework [10].

Cronbach's alphafor internal consistency. Benchmarks with α < 0.7 are flagged for item review.
Item discrimination analysis. Scenarios with near-zero discrimination are candidates for replacement.
ICC for test-retest reliability. ICC values below 0.75 trigger investigation into scoring instability.
Factor analysis. Confirms whether intended construct structure holds empirically.

8. Post-Deployment Monitoring

BetterHealthBench supports continuous post-deployment monitoring, informed by the FDA's PCCP guidance [11], real-world evidence approaches for LLM-based clinical systems [12], and the NEJM AI pragmatic trial operations playbook for ambient AI [23].

Regression detection. Each monitoring run selects a stratified subset of scenarios plus a worst-case subset targeting scenarios where the model previously scored lowest. Two-pronged selection ensures both representative coverage and sensitivity to regressions in known weak areas.

Asymmetric safety thresholds. Safety regressions are held to stricter thresholds than performance regressions. Any safety drop triggers a critical alert.

Version-over-version comparison. Every run is compared against the previous run and the initial baseline, with full provenance (model version, harness version, timestamp) for auditable tracking.

9. Scenario Coverage

BetterHealthBench currently includes ~400 clinical scenarios spanning triage, differential diagnosis, clinical summarization, ambient scribe, voice interaction, and multimodal (imaging) tasks. Scenarios are sourced through a clinician-authored pipeline grounded in international clinical standards.

Canada

CCFP/LMCC alignment. Mapped to CCFP SOO/SAMP and LMCC competency domains.
CanMEDS competency mapping. Tagged with relevant CanMEDS roles for analysis across competency dimensions.
CTAS triage protocol. Triage scenarios follow the Canadian Triage and Acuity Scale.

United States

ESI triage protocol alongside CTAS.
USMLE-aligned knowledge benchmarks via MedQA.
AHA/ACC/ACEP guidelines referenced where applicable.

Adversarial Patient Taxonomy (planned)

Adopting the 5-dimension parametric taxonomy from MedDialBench [16]: Logic Consistency, Health Cognition, Expression Style, Disclosure, Attitude. Each dimension is graded for dose-response analysis, allowing isolation of which adversarial patient behaviors most degrade model safety.

10. Validation Landscape

BetterHealthBench is built with awareness of how the industry evaluates deployed clinical AI tools.

Abridge — JAMA Network Open 2025 multi-site QI study. Clinician burnout decreased from 51.9% to 38.8%, with 30 min/day time savings [13].
Nuance DAX Copilot — NEJM AI 2025 longitudinal study (112 clinicians). Primary endpoints not statistically significant [14].
Hippocratic AI RWE-LLM — 6,234 clinicians, 307,038 evaluations. Largest-scale RWE study for LLM-based clinical systems [12].
Google Med-PaLM 2 — Nature 2023 physician panel evaluation, validating LLM-as-judge approaches grounded in clinical rubrics [15].
NOHARM / ARISE Network — 100 real primary care cases, 31 LLMs, 29 specialists, 12,747 annotations. Best models still make 12–15 severe errors per 100 cases [20].
STELLA / Atella — peer-reviewed quantification of per-turn safety decay (+0.3%/turn harmful, +0.7%/turn benefit loss) across 5 frontier chatbots [19].
MedAgentBench — NEJM AI publication of 300 FHIR agentic tasks in a virtual EHR environment [21].

11. Statistical Methods

We use Welch's t-test (not Student's t-test) for comparing model scores because it does not assume equal variance between groups. Significance threshold is p < 0.05; effect size reported via Cohen's d. Bonferroni correction is applied for multiple comparisons. Bootstrap confidence intervals (1,000 resamples) are used for reliability metrics.

For drift detection, paired bootstrap is used for version-over-version comparison. CUSUM / statistical process control is planned to detect slow consistent drift that no individual delta would trip.

12. Longitudinal Drift Detection

Models update silently. A model that was safe last month may not be safe today. Drift detection uses the Kolmogorov-Smirnov test (for distributional shifts) and the Mann-Whitney U test (for median score changes). A drift alert triggers when a model's score changes by more than 5% between evaluation runs (confirmed by statistical testing). See /tracking for the live drift timeline visualization.

Continue to References →