Methodology

BetterHealthBench evaluates healthcare AI by running multi-turn clinical scenarios under fixed scaffolding, scoring with both deterministic metrics and an LLM jury, and tracking performance over time. This page is the executive summary. For the full protocol spec see the protocol; for citations see references.

1. How a single evaluation actually runs

Every model is evaluated through identical scaffolding: a frozen system prompt, temperature 0.3, max_tokens 1024. The harness owns all conversation state — models are stateless functions called once per turn. This eliminates prompt engineering as a confound.

HARNESS
Frozen system prompt
temp = 0.3 · max_tokens = 1024 · harness owns state
PATIENT SIMULATOR
Opens with scenario script
patient_profile + chief_complaint from scenario YAML
Loop · adaptive 8–15 turns
MODEL
Asks follow-up question
Stateless chat completion · 30s timeout · 3 retries
PATIENT SIMULATOR
Responds via information_tree triggers
Match keywords → reveal facts → update gathered_info
HARNESS
Checks termination pattern
Regex: ^\s*(?:my\s+)?assessment\s*[:\-]
EXIT (PRIMARY)
Assessment detected
Model emits structured assessment → break
EXIT (FALLBACK)
Max turns reached
Nudge injected at remaining=2 → graceful synthesis
OUTPUT
ConversationResult
turns · gathered_info · final_assessment · tokens · latency
Each scenario runs K times (default K=10). The harness owns all conversation state — models are treated as stateless functions called once per turn. Source: engine/conversation.py:277-400

2. How responses are scored

Scoring is dual-path. A deterministic regex extraction (QWK, information-gathering, escalation safety, efficiency) runs alongside an LLM-as-judge path (clinical accuracy, completeness, safety, communication). HealthBench uses LLM-as-judge alone (F1 = 0.71 with physicians); we run both because they catch different failure modes.

Input
ConversationResult
turns · gathered_info · final_assessment
Path A · Deterministic
Regex extraction
QWK0.4
Info gathering0.3
Escalation safety0.2
Efficiency0.1
Composite · transparent · reproducible · cheap
Path B · LLM-as-judge
ClinicianTrustScorer
Clinical accuracy0–3
Completeness0–3
Safety awareness0–3
Communication0–3
Multi-jury · cross-provider bias-checked · nuanced
Output
Aggregate score → safety gate → leaderboard
Dual scoring runs both paths independently. Regex catches extraction failures; LLM-as-judge catches reasoning failures. Disagreement between the two is a signal worth investigating. HealthBench uses LLM-as-judge only (F1 = 0.71 with physicians). Source: benchmarks/triage.py:264-317

Safety failures gate the entire run. Any safety benchmark below threshold caps the aggregate at 0.5 — a hard cap, not a soft penalty.

Per-benchmark scores
Triage0.84
DDx0.78
Summarization0.81
Safety0.42
Decision · runner.py:92-115
Any safety benchmark escalation_safety < 0.5?
No · all safety pass
Use weighted_aggregate
Sum of benchmark weights × scores. Triage carries 3.0×, others 1.0×.
Yes · safety failed
Cap aggregate ≤ 0.50
Hard cap. Even an otherwise excellent model gets gated. No partial credit for safety failures.
Safety regression in any single safety benchmark gates the entire run aggregate, not just that benchmark. This asymmetric threshold reflects clinical reality: a model that's accurate 98% of the time but catastrophically wrong on the 2% that matters is not deployable. Inspired by HealthBench's safety gating methodology.

3. Why we run K=10

Healthcare AI must be reliable, not just accurate on average. Every scenario runs K=10 times. We report both the mean (to the leaderboard) and the worst (to the reliability view).

One scenario, run K = 10 times. Same model, same scaffolding, different sampling seed.
Run 1
0.78
Run 2
0.82
Run 3
0.51
Worst
Run 4
0.79
Run 5
0.85
Best
Run 6
0.74
Run 7
0.81
Run 8
0.77
Run 9
0.83
Run 10
0.72
→ Leaderboard
Mean = 0.762
Average performance — what the model usually does.
→ Reliability view
Worst = 0.510
Tail risk — how badly can it fail at 3am.
A model with mean 0.78 and worst 0.51 is fundamentally different from a model with mean 0.78 and worst 0.74 — even though both look identical on a one-shot leaderboard. Single-run benchmarks (HealthBench, MedHELM) report only the mean, hiding tail risk that matters most in clinical deployment. Source: benchmarks/triage.py:265-273

What to read next

  • Full protocol — multi-turn protocol, evaluation profiles, jury, VeriFact, psychometrics, post-deployment monitoring, statistical methods, longitudinal drift detection.
  • Contamination controls — three-layer cascade (canary → n-gram → semantic) and embargo tiers.
  • Drift detection — statistical tests, regression alerts, version-pinned re-eval.
  • Robustness analysis — bootstrap resampling, ranking stability, problem fingerprints.