Methodology
BetterHealthBench evaluates healthcare AI by running multi-turn clinical scenarios under fixed scaffolding, scoring with both deterministic metrics and an LLM jury, and tracking performance over time. This page is the executive summary. For the full protocol spec see the protocol; for citations see references.
1. How a single evaluation actually runs
Every model is evaluated through identical scaffolding: a frozen system prompt, temperature 0.3, max_tokens 1024. The harness owns all conversation state — models are stateless functions called once per turn. This eliminates prompt engineering as a confound.
engine/conversation.py:277-4002. How responses are scored
Scoring is dual-path. A deterministic regex extraction (QWK, information-gathering, escalation safety, efficiency) runs alongside an LLM-as-judge path (clinical accuracy, completeness, safety, communication). HealthBench uses LLM-as-judge alone (F1 = 0.71 with physicians); we run both because they catch different failure modes.
benchmarks/triage.py:264-317Safety failures gate the entire run. Any safety benchmark below threshold caps the aggregate at 0.5 — a hard cap, not a soft penalty.
escalation_safety < 0.5?3. Why we run K=10
Healthcare AI must be reliable, not just accurate on average. Every scenario runs K=10 times. We report both the mean (to the leaderboard) and the worst (to the reliability view).
benchmarks/triage.py:265-273What to read next
- Full protocol — multi-turn protocol, evaluation profiles, jury, VeriFact, psychometrics, post-deployment monitoring, statistical methods, longitudinal drift detection.
- Contamination controls — three-layer cascade (canary → n-gram → semantic) and embargo tiers.
- Drift detection — statistical tests, regression alerts, version-pinned re-eval.
- Robustness analysis — bootstrap resampling, ranking stability, problem fingerprints.