Methodology

BetterHealthBench evaluates healthcare AI by running multi-turn clinical scenarios under fixed scaffolding, scoring with both deterministic metrics and an LLM jury, and tracking performance over time. This page is the executive summary. For the full protocol spec see the protocol; for citations see references.

1. How a single evaluation actually runs

Every model is evaluated through identical scaffolding: a frozen system prompt, temperature 0.3, max_tokens 1024. The harness owns all conversation state — models are stateless functions called once per turn. This eliminates prompt engineering as a confound.

HARNESS

Frozen system prompt

temp = 0.3 · max_tokens = 1024 · harness owns state

PATIENT SIMULATOR

Opens with scenario script

patient_profile + chief_complaint from scenario YAML

Loop · adaptive 8–15 turns

MODEL

Asks follow-up question

Stateless chat completion · 30s timeout · 3 retries

PATIENT SIMULATOR

Responds via information_tree triggers

Match keywords → reveal facts → update gathered_info

HARNESS

Checks termination pattern

Regex: ^\s*(?:my\s+)?assessment\s*[:\-]

EXIT (PRIMARY)

Assessment detected

Model emits structured assessment → break

EXIT (FALLBACK)

Max turns reached

Nudge injected at remaining=2 → graceful synthesis

OUTPUT

ConversationResult

turns · gathered_info · final_assessment · tokens · latency

Each scenario runs K times (default K=10). The harness owns all conversation state — models are treated as stateless functions called once per turn. Source: engine/conversation.py:277-400

2. How responses are scored

Scoring is dual-path. A deterministic regex extraction (QWK, information-gathering, escalation safety, efficiency) runs alongside an LLM-as-judge path (clinical accuracy, completeness, safety, communication). HealthBench uses LLM-as-judge alone (F1 = 0.71 with physicians); we run both because they catch different failure modes.

Input

ConversationResult

turns · gathered_info · final_assessment

Path A · Deterministic

Regex extraction

QWK0.4

Info gathering0.3

Escalation safety0.2

Efficiency0.1

Composite · transparent · reproducible · cheap

Path B · LLM-as-judge

ClinicianTrustScorer

Clinical accuracy0–3

Completeness0–3

Safety awareness0–3

Communication0–3

Multi-jury · cross-provider bias-checked · nuanced

Output

Aggregate score → safety gate → leaderboard

Dual scoring runs both paths independently. Regex catches extraction failures; LLM-as-judge catches reasoning failures. Disagreement between the two is a signal worth investigating. HealthBench uses LLM-as-judge only (F1 = 0.71 with physicians). Source: benchmarks/triage.py:264-317

Safety failures gate the entire run. Any safety benchmark below threshold caps the aggregate at 0.5 — a hard cap, not a soft penalty.

Per-benchmark scores

Triage0.84

DDx0.78

Summarization0.81

Safety0.42

Decision · runner.py:92-115

Any safety benchmark escalation_safety < 0.5?

No · all safety pass

Use weighted_aggregate

Sum of benchmark weights × scores. Triage carries 3.0×, others 1.0×.

Yes · safety failed

Cap aggregate ≤ 0.50

Hard cap. Even an otherwise excellent model gets gated. No partial credit for safety failures.

Safety regression in any single safety benchmark gates the entire run aggregate, not just that benchmark. This asymmetric threshold reflects clinical reality: a model that's accurate 98% of the time but catastrophically wrong on the 2% that matters is not deployable. Inspired by HealthBench's safety gating methodology.

3. Why we run K=10

Healthcare AI must be reliable, not just accurate on average. Every scenario runs K=10 times. We report both the mean (to the leaderboard) and the worst (to the reliability view).

One scenario, run K = 10 times. Same model, same scaffolding, different sampling seed.

Run 1

0.78

Run 2

0.82

Run 3

0.51

Worst

Run 4

0.79

Run 5

0.85

Best

Run 6

0.74

Run 7

0.81

Run 8

0.77

Run 9

0.83

Run 10

0.72

→ Leaderboard

Mean = 0.762

Average performance — what the model usually does.

→ Reliability view

Worst = 0.510

Tail risk — how badly can it fail at 3am.

A model with mean 0.78 and worst 0.51 is fundamentally different from a model with mean 0.78 and worst 0.74 — even though both look identical on a one-shot leaderboard. Single-run benchmarks (HealthBench, MedHELM) report only the mean, hiding tail risk that matters most in clinical deployment. Source: benchmarks/triage.py:265-273

Methodology

1. How a single evaluation actually runs

2. How responses are scored

3. Why we run K=10

What to read next