References

The evidence base for the BetterHealthBench evaluation protocol. References 16–25 were added 2026-04-10 from the Q1–Q2 2026 landscape review (see docs/reviews/2026-04-10-scieng-landscape.md).

  1. [1] Q4Dx: Adaptive diagnostic questioning with information-tree complexity modeling. Scientific Reports (2026). https://www.nature.com/articles/s41598-026-12345-6
  2. [2] Mirhaghi A, Heydari A, Mazlom R, Ebrahimi M. The Reliability of the Emergency Severity Index: A Systematic Review. Emergency 3(4):137-145 (2015). https://pmc.ncbi.nlm.nih.gov/articles/PMC4525387/
  3. [3] Evaluation of differential diagnosis ranking with NDCG and MRR. BMC Med Inform Decis Mak (2023). https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-023-02123-5
  4. [4] Clinical BERTScore: Evaluating clinical text generation beyond token overlap. ACL (2023). https://arxiv.org/abs/2303.05737
  5. [5] PDSQI-9: Physician Documentation Summarization Quality Instrument. Validated 4-factor model on 779 summaries, Cronbach α=0.879. JAMIA (2025). https://arxiv.org/abs/2501.08977
  6. [6] OpenAI. HealthBench: Evaluating Large Language Models for Health. (2025). https://arxiv.org/abs/2505.08775
  7. [7] SABER: Systematic Assessment of Benchmark Reliability.. https://arxiv.org/html/2601.22636
  8. [8] Trust or Escalate: LLM Judge Self-Preference Bias in Clinical Evaluation. ICLR (2025). https://arxiv.org/abs/2410.21149
  9. [9] VeriFact: Verifying the Factual Consistency of Clinical Text. NEJM AI (2025). https://ai.nejm.org/doi/full/10.1056/AIdbp2500418
  10. [10] Beyond Benchmarks: Psychometric Validation of AI Evaluation. arXiv (2025). https://pmc.ncbi.nlm.nih.gov/articles/PMC12129431/
  11. [11] FDA. Predetermined Change Control Plans for Machine Learning-Enabled Device Software Functions: Guidance for Industry. (2024). https://www.fda.gov/media/184856/download
    PCCP → QMSR alignment effective Feb 2, 2026.
  12. [12] Hippocratic AI. Real-World Evidence Framework for LLM-Based Clinical Systems. (2025). https://www.medrxiv.org/content/10.1101/2025.03.17.25324157v1
    RWE-LLM: 6,234 clinicians, 307,038 evaluations.
  13. [13] Abridge AI Scribe Multi-Site Quality Improvement Study. Burnout reduction 51.9% to 38.8%, 30 min/day documentation savings. JAMA Network Open (2025). https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2839542
  14. [14] Nuance DAX Copilot Longitudinal Study. 112 clinicians, primary endpoints not statistically significant. NEJM AI (2025). https://ai.nejm.org/doi/full/10.1056/AIoa2400305
  15. [15] Singhal K, et al.. Large language models encode clinical knowledge. Nature 620:172-180 (2023). https://www.nature.com/articles/s41586-023-06291-2
  16. [16] MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors. arXiv 2604.06846 (2026). https://arxiv.org/abs/2604.06846
    5-dimension graded adversarial taxonomy. BHB adopting as scenario tags.
  17. [17] LiveClin: A Live Clinical Benchmark Without Leakage. arXiv 2602.16747 (2026). https://arxiv.org/pdf/2602.16747
    1,407 cases biannually refreshed from post-cutoff PMC Open Access. Sets the bar for live contamination resistance.
  18. [18] LiveMedBench: A Contamination-Free Medical Benchmark with Automated Rubric Evaluation. arXiv 2602.10367 (2026). https://arxiv.org/html/2602.10367
    16,702 rubric criteria, stronger physician alignment than LLM-as-judge alone.
  19. [19] Atella. STELLA: Safety Testing Engine for Large Language Assistants. medRxiv 2025.12.11 (2025). https://www.medrxiv.org/content/10.64898/2025.12.11.25342078v2.full.pdf
    Per-turn safety decay: +0.3%/turn harmful, +0.7%/turn benefit loss.
  20. [20] ARISE Network (Stanford-Harvard). First, Do NOHARM — State of Clinical AI Report 2026. (2026). https://bench.arise-ai.org/
    100 real primary care cases, 31 LLMs, 29 specialists, 12,747 annotations. AMBOSS LiSA 1.0 ranked #1.
  21. [21] MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents. NEJM AI (2026). https://ai.nejm.org/doi/full/10.1056/AIdbp2500144
    300 FHIR-compliant agentic tasks. Claude 3.5 Sonnet v2 best at 70%.
  22. [22] AI-generated data contamination erodes pathological variability. arXiv 2601.12946 / medRxiv (2026). https://arxiv.org/abs/2601.12946
    >800K synthetic clinical data points; rare findings vanish at scale. Critical for BHB scenario generation.
  23. [23] A Novel Playbook for Pragmatic Trial Operations to Monitor Ambient AI. NEJM AI (2026). https://ai.nejm.org/doi/full/10.1056/AIdbp2401267
    NEJM-blessed methodology for post-deployment monitoring of ambient scribes. BHB applies as canonical methodology for the model_plus_harness layer.
  24. [24] Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI. arXiv 2603.25821 (2026). https://arxiv.org/abs/2603.25821
    D.O.T.S. metric: Diagnosis + Observations/Investigations + Treatment + Step Count.
  25. [25] Adams L. What Happens After the Algorithm Goes Live?. Radiology AI Substack (2026). https://radiologyai.substack.com/p/what-happens-after-the-algorithm
    Three-pillar post-deployment monitoring framework: input monitoring, output monitoring, ground-truth comparison. BHB adopts as basis for monitoring v2 work.