For a non-technical overview: View plain-English version →
The rating agency for
healthcare frontier models.
ambient scribes.
clinical triage tools.
clinical decision support.
patient-facing AI.

Independent, continuous evaluation of clinical AI. Does the tool perform safely on real healthcare workflows? Has that performance changed? We measure it. We publish it.

Curious about the methodology? Read the protocol →

Live leaderboard

Top models by overall score across ~400 multi-turn clinical scenarios. Updated monthly; worst-of-10 reliability shown alongside the headline score.

View full leaderboard →
#ModelOverall
1Claude 3.5 Sonnet0.87
2Nuance DAX0.85
3GPT-4o0.85
4MedGemma0.84
5Heidi Health0.84
6OpenEvidence0.83

245scenarios · harness v2.1

34
benchmarks
~400
clinical scenarios
40
criteria per conversation
10
runs per scenario — we report the worst

Every regulated industry has independent evaluation.
Healthcare AI has none.

Models score at physician level on medical exams but fail on the workflow tasks deployment actually requires. Static benchmarks measure knowledge recall. Clinical safety depends on judgment, information gathering, and knowing when to escalate. The measurement problem is as important as the modeling problem.

Why now

Regulatory mandates, patient safety alerts, and litigation are converging. The gap between AI deployment velocity and evaluation infrastructure is now a measurable risk — for vendors, health systems, and payers.

1,300+

FDA-authorized AI devices

295 in 2025 — a record. Device population growing faster than evaluation infrastructure.

#1

ECRI health tech hazard (2026)

Unvalidated AI chatbots: "not regulated as medical devices nor validated for healthcare purposes, yet increasingly used by clinicians."

84%

Health systems with governance committees — but no evaluation tools

CHIME survey (Dec 2025, n=51): CIO and CMIO representation exists, but committees lack tools to evaluate what they're procuring.

3+

Major class actions over AI-driven care decisions

Lokken v. UnitedHealth (D. Minn.), Kisting-Leung v. Cigna (C.D. Cal.), Barrows v. Humana — all proceeding. Sharp HealthCare sued over unconsented AI scribe recordings. 60+ AI device recalls in the past year, 43% within 12 months of clearance.

RES. 226

AMA: independent verification required

June 2025: algorithms must be verified by independent parties, not developers. AMA also released an 8-step governance toolkit for vendor evaluation.

2025–26

FDA + Health Canada validation mandates

Independent testing, bias analysis, and change control plans now required for AI devices. California AB 2013 (eff. Jan 2026) adds training data disclosure.

Sources: FDA AI-Enabled Device Database (2025) · ECRI Top 10 Health Technology Hazards & Patient Safety Concerns (2026) · CHIME Foundation AI Governance Survey (Dec 2025, n=51) · AMA House of Delegates Res. 226 (June 2025) · Health Canada MLMD Pre-Market Guidance (Feb 2025) · Georgetown Health Care Litigation Tracker (2026) · JMIR Med. Informatics, AI/ML Device Recalls (2025) · California AB 2013 (eff. Jan 2026).

How it works

1
Submit
Model endpoint or product API
No prompts shared. No special treatment.
2
Run
K=10 across ~400 scenarios
Multi-turn, fixed scaffolding, adversarial subset.
3
Score
Regex + LLM-as-judge
Worst-of-K reliability + safety gate.
4
Publish
Monthly signed reports
Tamper-evident · drift-tracked · public.
1Submit your model endpoint.

Centralized execution under identical conditions. Vendors never see the test prompts. No model gets special treatment. The evaluation suite is tailored to what your tool does — a scribe gets scribe benchmarks, a CDS tool gets triage and diagnosis.

2Run ~400 clinical scenarios.

Multi-turn triage, differential diagnosis, clinical summarization, ambient scribe evaluation, and adversarial safety cases, all scored against physician-defined rubric criteria.

3Score every response.

Dual scoring: transparent regex for reproducibility, LLM-as-judge for nuance. Worst-of-K reliability because a single critical failure at 3am matters more than a high average.

4Publish results monthly.

Signed, tamper-evident reports. Longitudinal tracking across model versions. Drift detection catches silent regressions before patients do.

How we compare

Compared against the three benchmarks that matter most right now: the de facto standard (HealthBench), the best-funded platform (Qualified Health), and the emerging academic kingmaker (ARISE). We include our honest gaps at the bottom.

CapabilityBHBHealthBenchOpenAIQualified Health$125M platformARISEStanford / Harvard
Independence
Structurally independent (no vendor affiliation)YesNoNoYes
Publicly published methodologyYesYesNoYes
What gets tested
Deployed clinical tools (not just raw models)YesNo~PartialNo
Evaluation layer declared (model vs product vs agent)YesNoNoNo
Multimodal scenarios (radiology, dermatology)YesNoNoNo
How it's tested
Multi-turn clinical conversationsYesYes~PartialYes
Worst-of-K reliability (K=10)YesNoNoNo
Vendor-blind test set (prompts never shared)YesNo~PartialYes
Contamination detection (canary + n-gram + semantic)YesNoNoNo
Over time
Longitudinal drift detection (statistical tests)YesNo~PartialNo
Continuous re-evaluation on new model releasesYesNo~PartialNo
Context-layer safety (persistent memory drift)YesNoNoNo
Where we're behind (honest gaps)
Peer-reviewed publicationNoYesNoYes
Large-scale rubric criteria (5K+)NoYesNoNo
Deployed at health systemsNoNoYesNo
Validated on real patient dataNoNoYesYes

= yes  ~= partial / private-only  = no. Qualified Health's capabilities are internal to their platform (not publicly accessible). ARISE data from NOHARM consortium (Jan 2026, 100 primary care cases, 31 LLMs, 29 specialists).

Neutrality is the product.

Structural independence by design, not intent. Vendors submit endpoints. They never see the prompts.

The database compounds.

Every evaluation deepens contamination detection and drift baselines. The 50th company benefits from the 49 before it.

The standard-setting flywheel.

Credibility earns inclusion, inclusion earns data, data earns authority. Authority earns more inclusion.

BetterHealthBench

Evaluation infrastructure for clinical AI, built to earn the trust it carries.