BetterHealthBench

For ML researchers and engineers: View technical version →

Independent testing for

AI that writes your clinical notes.

AI that triages your patients.

AI that suggests diagnoses.

AI that talks to your patients.

AI used in clinical decisions.

We test healthcare AI tools the way they actually get used — in realistic clinical conversations. Then we publish the results so you can see which tools are safe and which ones aren't.

See the Results Get Your Tool Tested

Want the technical details? Read how we test →

What the leaderboard looks like

Top clinical AI tools we've evaluated, ranked by overall score. Updated monthly.

View full leaderboard →

#	Model	Provider	Overall	Safety	Worst-of-10
1	Claude 3.5 Sonnet	Anthropic	0.87	1.00	0.85
2	Nuance DAX	Microsoft/Nuance	0.85	0.97	0.81
3	GPT-4o	OpenAI	0.85	0.98	0.86
4	MedGemma	Google DeepMind	0.84	0.96	0.82
5	Heidi Health	Heidi Health	0.84	0.91	0.79

245scenarios · harness v2.1

test suites covering triage, diagnosis, documentation, and safety

~400

real clinical situations, from chest pain to mental health

things checked every conversation — did it ask the right questions? Miss anything dangerous?

times every tool is tested. We publish the worst result, not the average.

Banks get audited. Drugs get tested.
Healthcare AI tools? Nobody checks if they actually work safely.

AI tools can pass medical exams but still fail at the tasks clinicians actually need them to do. Existing tests check whether an AI knows medical facts. They don't check whether it asks the right questions, catches dangerous symptoms, or knows when to tell a patient to go to the ER. That's what we test.

Why now

Healthcare AI is being deployed faster than it's being evaluated. Regulators, safety organizations, and courts are all converging on the same conclusion: independent validation isn't optional anymore.

1,300+

AI devices cleared by the FDA

295 in 2025 alone — a record. More tools in clinics every month, but almost none independently tested.

Health tech hazard of 2026

ECRI — the top patient safety watchdog — says unvalidated AI chatbots are the biggest risk in health technology right now.

84%

of hospitals have AI committees — with no way to test what they buy

Governance exists on paper. Most committees don't have tools to actually evaluate the AI they're approving.

Major class actions over AI-driven care decisions

UnitedHealth, Cigna, and Humana are all in court for using AI to deny claims. Sharp HealthCare was sued for recording patients with an AI scribe without consent. 60+ AI devices recalled in the past year.

RES. 226

AMA says: don't let vendors grade their own AI

The American Medical Association now requires independent third parties — not the company that built it — to verify healthcare AI.

2025–26

FDA and Health Canada now require validation

Both countries require independent testing, bias checks, and a plan for monitoring AI after it ships. This wasn't optional before — now it's written down.

Sources: FDA AI-Enabled Device Database (2025), ECRI Top 10 Health Technology Hazards (2026), CHIME Foundation AI Governance Survey (Dec 2025), AMA House of Delegates Resolution 226 (June 2025), Health Canada ML-Enabled Medical Device Guidance (Feb 2025), Georgetown Health Care Litigation Tracker (2026).

How it works

Submit

Model endpoint or product API

No prompts shared. No special treatment.

Run

K=10 across ~400 scenarios

Multi-turn, fixed scaffolding, adversarial subset.

Score

Regex + LLM-as-judge

Worst-of-K reliability + safety gate.

Publish

Monthly signed reports

Tamper-evident · drift-tracked · public.

1You give us access to your AI tool.

We run everything under the same conditions so every tool gets a fair comparison. Your AI never sees the test questions beforehand. No tool gets special treatment. We match the tests to what your tool actually does — if it writes clinical notes, we test note-writing; if it helps with triage, we test triage.

2We run it through ~400 realistic clinical situations.

Full back-and-forth conversations covering triage, diagnosis, clinical documentation, and safety edge cases. Every scenario is based on real clinical workflows and reviewed by practicing physicians.

3We score every answer: Did it get the diagnosis right? Did it miss anything dangerous? Was it safe?

Every response is checked two ways: automated checks for clear right/wrong answers, plus physician-level review for clinical judgment. We test each scenario 10 times because a tool that fails once at 3am matters more than one that works on average.

4We publish the results. Every month. No editing, no exceptions.

Results are locked and tamper-proof. We track performance over time so you can see if a tool is getting better or worse. If an update quietly breaks something, we catch it before patients are affected.

Who needs this

Clinical AI Vendors

Your sales team needs proof your AI is safe. We provide it.

The FDA now requires independent validation data in premarket submissions for AI-enabled devices. The AMA calls for independent third parties — not the developer — to verify algorithms. Hospital procurement committees want this evidence before they sign off. A BetterHealthBench report gives your buyers the third-party validation they need to move forward.

Health Systems

84% of health systems have AI governance committees. Most lack the tools to evaluate what they're buying.

ECRI named unvalidated AI the #1 health technology hazard for 2026. A tool that works well on average may miss critical edge cases in your patient population. Your governance committee needs more than vendor slide decks — it needs independent evidence of how these tools actually perform on realistic clinical scenarios.

Healthcare Payers

Three major insurers are being sued over AI care decisions. The pattern is clear.

UnitedHealth, Cigna, and Humana all face class actions for using AI algorithms to deny claims without adequate review. Courts are ordering disclosure of how these algorithms work. Provider networks are deploying AI in care delivery and you carry the liability. Independent performance data is the foundation for informed coverage and risk decisions.

Regulators & Standards Bodies

The FDA, Health Canada, and the AMA all agree: independent validation is required. Nobody has built the tool yet.

Both the FDA and Health Canada now require bias analysis, independent dataset validation, and predetermined change control plans for AI medical devices. The AMA adopted Resolution 226 calling for independent third-party algorithm verification. We're building the concrete, repeatable evaluation methodology these frameworks need.

7 questions to ask about any healthcare AI benchmark

Not all evaluations are equal. Before trusting a benchmark score, ask these questions. If the answer to any of them is “no” or “we don't know,” the score doesn't tell you what you think it does.

1Does it test conversations — or just single questions?

Why it matters: Clinical AI operates in back-and-forth dialogue. A tool that answers one question correctly may miss critical follow-ups, fail to gather history, or give dangerous advice on turn 3.

Our answer: We run full multi-turn clinical conversations — up to 10 turns of realistic back-and-forth.

2Does it test the actual product — or just the raw AI model?

Why it matters: The product you deploy includes prompts, guardrails, and data pipelines that change how the AI behaves. Testing the raw model tells you nothing about the tool your clinicians actually use.

Our answer: We declare the evaluation layer: raw model, deployed product, or agent with memory. Same score means different things at each layer.

3Who runs the test — the vendor or an independent party?

Why it matters: When the company that built the AI also runs the evaluation, there is a conflict of interest. OpenAI's HealthBench grades OpenAI's models. That's like a drug company running its own clinical trial with no oversight.

Our answer: We are structurally independent. No vendor affiliation. Vendors never see the test questions.

4Does it track performance over time — or just a single snapshot?

Why it matters: AI models update silently. A tool that scored well last month can quietly degrade after an update. If nobody re-tests, nobody notices until a patient is harmed.

Our answer: Monthly re-evaluation. Statistical drift detection. Version-pinned comparisons. Alerts when performance changes.

5How many times is each scenario tested?

Why it matters: AI tools can give different answers to the same question. A single test run hides failures. A tool that works 9 out of 10 times still fails a patient at 3 AM.

Our answer: Every scenario runs 10 times. We report the worst result, not the average.

6Could the AI have seen the test questions during training?

Why it matters: If an AI model was trained on the benchmark questions, its scores are inflated. It is reciting, not reasoning. This is called contamination — and most benchmarks don't check for it.

Our answer: Three-layer contamination detection: canary strings, n-gram overlap, and semantic similarity. Flagged scenarios are quarantined.

7What happens when the AI has built up context about a patient over time?

Why it matters: An AI agent with accumulated memory may behave differently than a fresh session. Prior context can introduce bias, anchor on outdated information, or cause safety drift.

Our answer: We test context-layer safety — how the AI behaves after building up patient history, not just on a clean slate.

We don't sell AI. We test it.

We have no affiliation with any AI vendor. Tools are tested under controlled conditions. Vendors never see the test questions.

Every test makes the next one better.

Each evaluation adds to our understanding of how AI tools fail. The 50th company tested benefits from everything we learned testing the first 49.

The more tools we test, the more the industry trusts the results.

Trust earns adoption. Adoption generates better data. Better data builds authority. That authority brings in more participants.

BetterHealthBench

Find out how your AI tool performs. Or see who's leading.

Get Your Tool Tested See the Results