BetterHealthBench

For a non-technical overview: View plain-English version →

The rating agency for

healthcare frontier models.

ambient scribes.

clinical triage tools.

clinical decision support.

patient-facing AI.

Independent, continuous evaluation of clinical AI. Does the tool perform safely on real healthcare workflows? Has that performance changed? We measure it. We publish it.

View Leaderboard Submit a Model

Curious about the methodology? Read the protocol →

Live leaderboard

Top models by overall score across ~400 multi-turn clinical scenarios. Updated monthly; worst-of-10 reliability shown alongside the headline score.

View full leaderboard →

#	Model	Provider	Overall	Safety	Worst-of-10
1	Claude 3.5 Sonnet	Anthropic	0.87	1.00	0.85
2	Nuance DAX	Microsoft/Nuance	0.85	0.97	0.81
3	GPT-4o	OpenAI	0.85	0.98	0.86
4	MedGemma	Google DeepMind	0.84	0.96	0.82
5	Heidi Health	Heidi Health	0.84	0.91	0.79
6	OpenEvidence	OpenEvidence	0.83	0.93	0.73

245scenarios · harness v2.1

benchmarks

~400

clinical scenarios

criteria per conversation

runs per scenario — we report the worst

Every regulated industry has independent evaluation.
Healthcare AI has none.

Models score at physician level on medical exams but fail on the workflow tasks deployment actually requires. Static benchmarks measure knowledge recall. Clinical safety depends on judgment, information gathering, and knowing when to escalate. The measurement problem is as important as the modeling problem.

Why now

Regulatory mandates, patient safety alerts, and litigation are converging. The gap between AI deployment velocity and evaluation infrastructure is now a measurable risk — for vendors, health systems, and payers.

1,300+

FDA-authorized AI devices

295 in 2025 — a record. Device population growing faster than evaluation infrastructure.

ECRI health tech hazard (2026)

Unvalidated AI chatbots: "not regulated as medical devices nor validated for healthcare purposes, yet increasingly used by clinicians."

84%

Health systems with governance committees — but no evaluation tools

CHIME survey (Dec 2025, n=51): CIO and CMIO representation exists, but committees lack tools to evaluate what they're procuring.

Major class actions over AI-driven care decisions

Lokken v. UnitedHealth (D. Minn.), Kisting-Leung v. Cigna (C.D. Cal.), Barrows v. Humana — all proceeding. Sharp HealthCare sued over unconsented AI scribe recordings. 60+ AI device recalls in the past year, 43% within 12 months of clearance.

RES. 226

AMA: independent verification required

June 2025: algorithms must be verified by independent parties, not developers. AMA also released an 8-step governance toolkit for vendor evaluation.

2025–26

FDA + Health Canada validation mandates

Independent testing, bias analysis, and change control plans now required for AI devices. California AB 2013 (eff. Jan 2026) adds training data disclosure.

Sources: FDA AI-Enabled Device Database (2025) · ECRI Top 10 Health Technology Hazards & Patient Safety Concerns (2026) · CHIME Foundation AI Governance Survey (Dec 2025, n=51) · AMA House of Delegates Res. 226 (June 2025) · Health Canada MLMD Pre-Market Guidance (Feb 2025) · Georgetown Health Care Litigation Tracker (2026) · JMIR Med. Informatics, AI/ML Device Recalls (2025) · California AB 2013 (eff. Jan 2026).

How it works

Submit

Model endpoint or product API

No prompts shared. No special treatment.

Run

K=10 across ~400 scenarios

Multi-turn, fixed scaffolding, adversarial subset.

Score

Regex + LLM-as-judge

Worst-of-K reliability + safety gate.

Publish

Monthly signed reports

Tamper-evident · drift-tracked · public.

1Submit your model endpoint.

Centralized execution under identical conditions. Vendors never see the test prompts. No model gets special treatment. The evaluation suite is tailored to what your tool does — a scribe gets scribe benchmarks, a CDS tool gets triage and diagnosis.

2Run ~400 clinical scenarios.

Multi-turn triage, differential diagnosis, clinical summarization, ambient scribe evaluation, and adversarial safety cases, all scored against physician-defined rubric criteria.

3Score every response.

Dual scoring: transparent regex for reproducibility, LLM-as-judge for nuance. Worst-of-K reliability because a single critical failure at 3am matters more than a high average.

4Publish results monthly.

Signed, tamper-evident reports. Longitudinal tracking across model versions. Drift detection catches silent regressions before patients do.

How we compare

Compared against the three benchmarks that matter most right now: the de facto standard (HealthBench), the best-funded platform (Qualified Health), and the emerging academic kingmaker (ARISE). We include our honest gaps at the bottom.

Capability	BHB	HealthBenchOpenAI	Qualified Health$125M platform	ARISEStanford / Harvard
Independence
Structurally independent (no vendor affiliation)	Yes	No	No	Yes
Publicly published methodology	Yes	Yes	No	Yes
What gets tested
Deployed clinical tools (not just raw models)	Yes	No	~Partial	No
Evaluation layer declared (model vs product vs agent)	Yes	No	No	No
Multimodal scenarios (radiology, dermatology)	Yes	No	No	No
How it's tested
Multi-turn clinical conversations	Yes	Yes	~Partial	Yes
Worst-of-K reliability (K=10)	Yes	No	No	No
Vendor-blind test set (prompts never shared)	Yes	No	~Partial	Yes
Contamination detection (canary + n-gram + semantic)	Yes	No	No	No
Over time
Longitudinal drift detection (statistical tests)	Yes	No	~Partial	No
Continuous re-evaluation on new model releases	Yes	No	~Partial	No
Context-layer safety (persistent memory drift)	Yes	No	No	No
Where we're behind (honest gaps)
Peer-reviewed publication	No	Yes	No	Yes
Large-scale rubric criteria (5K+)	No	Yes	No	No
Deployed at health systems	No	No	Yes	No
Validated on real patient data	No	No	Yes	Yes

✓= yes ~= partial / private-only ✕= no. Qualified Health's capabilities are internal to their platform (not publicly accessible). ARISE data from NOHARM consortium (Jan 2026, 100 primary care cases, 31 LLMs, 29 specialists).

Neutrality is the product.

Structural independence by design, not intent. Vendors submit endpoints. They never see the prompts.

The database compounds.

Every evaluation deepens contamination detection and drift baselines. The 50th company benefits from the 49 before it.

The standard-setting flywheel.

Credibility earns inclusion, inclusion earns data, data earns authority. Authority earns more inclusion.

BetterHealthBench

Evaluation infrastructure for clinical AI, built to earn the trust it carries.

Submit a Model View Leaderboard