Independent, continuous evaluation of clinical AI. Does the tool perform safely on real healthcare workflows? Has that performance changed? We measure it. We publish it.
Curious about the methodology? Read the protocol →
Live leaderboard
Top models by overall score across ~400 multi-turn clinical scenarios. Updated monthly; worst-of-10 reliability shown alongside the headline score.
| # | Model | Overall |
|---|---|---|
| 1 | Claude 3.5 Sonnet | 0.87 |
| 2 | Nuance DAX | 0.85 |
| 3 | GPT-4o | 0.85 |
| 4 | MedGemma | 0.84 |
| 5 | Heidi Health | 0.84 |
| 6 | OpenEvidence | 0.83 |
245scenarios · harness v2.1
Every regulated industry has independent evaluation.
Healthcare AI has none.
Models score at physician level on medical exams but fail on the workflow tasks deployment actually requires. Static benchmarks measure knowledge recall. Clinical safety depends on judgment, information gathering, and knowing when to escalate. The measurement problem is as important as the modeling problem.
Why now
Regulatory mandates, patient safety alerts, and litigation are converging. The gap between AI deployment velocity and evaluation infrastructure is now a measurable risk — for vendors, health systems, and payers.
1,300+
FDA-authorized AI devices
295 in 2025 — a record. Device population growing faster than evaluation infrastructure.
#1
ECRI health tech hazard (2026)
Unvalidated AI chatbots: "not regulated as medical devices nor validated for healthcare purposes, yet increasingly used by clinicians."
84%
Health systems with governance committees — but no evaluation tools
CHIME survey (Dec 2025, n=51): CIO and CMIO representation exists, but committees lack tools to evaluate what they're procuring.
3+
Major class actions over AI-driven care decisions
Lokken v. UnitedHealth (D. Minn.), Kisting-Leung v. Cigna (C.D. Cal.), Barrows v. Humana — all proceeding. Sharp HealthCare sued over unconsented AI scribe recordings. 60+ AI device recalls in the past year, 43% within 12 months of clearance.
RES. 226
AMA: independent verification required
June 2025: algorithms must be verified by independent parties, not developers. AMA also released an 8-step governance toolkit for vendor evaluation.
2025–26
FDA + Health Canada validation mandates
Independent testing, bias analysis, and change control plans now required for AI devices. California AB 2013 (eff. Jan 2026) adds training data disclosure.
Sources: FDA AI-Enabled Device Database (2025) · ECRI Top 10 Health Technology Hazards & Patient Safety Concerns (2026) · CHIME Foundation AI Governance Survey (Dec 2025, n=51) · AMA House of Delegates Res. 226 (June 2025) · Health Canada MLMD Pre-Market Guidance (Feb 2025) · Georgetown Health Care Litigation Tracker (2026) · JMIR Med. Informatics, AI/ML Device Recalls (2025) · California AB 2013 (eff. Jan 2026).
How it works
1Submit your model endpoint.
Centralized execution under identical conditions. Vendors never see the test prompts. No model gets special treatment. The evaluation suite is tailored to what your tool does — a scribe gets scribe benchmarks, a CDS tool gets triage and diagnosis.
2Run ~400 clinical scenarios.
Multi-turn triage, differential diagnosis, clinical summarization, ambient scribe evaluation, and adversarial safety cases, all scored against physician-defined rubric criteria.
3Score every response.
Dual scoring: transparent regex for reproducibility, LLM-as-judge for nuance. Worst-of-K reliability because a single critical failure at 3am matters more than a high average.
4Publish results monthly.
Signed, tamper-evident reports. Longitudinal tracking across model versions. Drift detection catches silent regressions before patients do.
How we compare
Compared against the three benchmarks that matter most right now: the de facto standard (HealthBench), the best-funded platform (Qualified Health), and the emerging academic kingmaker (ARISE). We include our honest gaps at the bottom.
| Capability | BHB | HealthBenchOpenAI | Qualified Health$125M platform | ARISEStanford / Harvard |
|---|---|---|---|---|
| Independence | ||||
| Structurally independent (no vendor affiliation) | Yes | No | No | Yes |
| Publicly published methodology | Yes | Yes | No | Yes |
| What gets tested | ||||
| Deployed clinical tools (not just raw models) | Yes | No | ~Partial | No |
| Evaluation layer declared (model vs product vs agent) | Yes | No | No | No |
| Multimodal scenarios (radiology, dermatology) | Yes | No | No | No |
| How it's tested | ||||
| Multi-turn clinical conversations | Yes | Yes | ~Partial | Yes |
| Worst-of-K reliability (K=10) | Yes | No | No | No |
| Vendor-blind test set (prompts never shared) | Yes | No | ~Partial | Yes |
| Contamination detection (canary + n-gram + semantic) | Yes | No | No | No |
| Over time | ||||
| Longitudinal drift detection (statistical tests) | Yes | No | ~Partial | No |
| Continuous re-evaluation on new model releases | Yes | No | ~Partial | No |
| Context-layer safety (persistent memory drift) | Yes | No | No | No |
| Where we're behind (honest gaps) | ||||
| Peer-reviewed publication | No | Yes | No | Yes |
| Large-scale rubric criteria (5K+) | No | Yes | No | No |
| Deployed at health systems | No | No | Yes | No |
| Validated on real patient data | No | No | Yes | Yes |
✓= yes ~= partial / private-only ✕= no. Qualified Health's capabilities are internal to their platform (not publicly accessible). ARISE data from NOHARM consortium (Jan 2026, 100 primary care cases, 31 LLMs, 29 specialists).
Neutrality is the product.
Structural independence by design, not intent. Vendors submit endpoints. They never see the prompts.
The database compounds.
Every evaluation deepens contamination detection and drift baselines. The 50th company benefits from the 49 before it.
The standard-setting flywheel.
Credibility earns inclusion, inclusion earns data, data earns authority. Authority earns more inclusion.
BetterHealthBench
Evaluation infrastructure for clinical AI, built to earn the trust it carries.