Data Use Policy

Last updated April 14, 2026

This policy explains how BetterHealthBench sources its evaluation data, what we do with the outputs models generate during evaluation, and the protections we apply to keep the benchmark trustworthy and PHI-free.

Where our scenarios come from

  • Synthetic clinical scenarios authored by practicing clinicians and validated by peer review.
  • Public medical education sources (e.g. licensing exam-style content and published case vignettes) adapted for multi-turn evaluation.
  • Published research datasets used only under their respective licenses.

We do not source scenarios from real patient encounters. No dataset on BetterHealthBench contains identifiable patient information.

PHI protection

  • Our evaluation harness supports a phi_safe mode for LLM-as-judge scoring that prevents raw conversation text from leaving your environment.
  • The store_raw_text flag on the result tracker is off-by-default for any integration that touches clinical text.
  • Submitters must not upload real PHI. Detected PHI will be purged and the submission rejected.

Model outputs

Outputs generated during evaluation (model responses, tool calls, scoring artifacts) are stored so that results are reproducible and auditable. Aggregate scores are published; raw transcripts are published only when the underlying scenario license permits it.

Contamination controls

A portion of every benchmark is held out and rotated. We monitor for dataset leakage and publish contamination checks alongside scores. See contamination controls for details.

Third-party processing

When we run LLM-as-judge scoring against third-party APIs, we send only the minimum text required, redact obvious identifiers, and prefer providers with zero-retention agreements for evaluation traffic.

Requests and deletion

Submitters may request removal of their model's results by contacting submissions@betterhealthbench.org. We will remove the entry but may retain the fact that a prior evaluation occurred to preserve the integrity of historical leaderboards.

Questions

data@betterhealthbench.org