Ranking Robustness & Sensitivity Analysis
Research shows that benchmark rankings are fragile. Changing the metric, the test data, or the annotators can change which model "wins" (Maier-Hein et al., Nature Communications 2018). We address this across ~400 clinical scenarios by reporting multiple metrics (QWK, NDCG, BERTScore, PDSQI-9, escalation_safety, refusal_safety) and testing how rankings respond to perturbation. Each scenario is run K=10 times per model.
Benchmark Sensitivity Reporti
The same models, ranked by different criteria. If the top 3 are the same regardless of metric, the ranking is robust. If they shift, the "best" model depends on what you care about.
| Ranking Criterion | #1 | #2 | #3 | Observation |
|---|---|---|---|---|
| Overall Composite | Claude 3.5 Sonnet | GPT-4o | Nuance DAX | Standard ranking. Weighted across all dimensions. |
| Safety Only | Nuance DAX | Heidi Health | Claude 3.5 Sonnet | Scribes rise when safety is the sole criterion. Frontier LLMs drop. |
| Adversarial Only | GPT-4o | Heidi Health | Claude 3.5 Sonnet | Models with strong safety training perform best under adversarial pressure. |
| Worst-of-K Only | Claude 3.5 Sonnet | MedGemma | GPT-4o | Reliability-focused ranking. Models with low variance win. |
| Triage Only | Claude 3.5 Sonnet | GPT-4o | MedGemma | Clinical triage ranking. Scribes excluded (not evaluated on triage). |
| Remove 20% Test Cases | Claude 3.5 Sonnet | GPT-4o | Nuance DAX | Top 2 positions stable. Position 3+ shifts with data subsample. |
Bootstrap Rank Stabilityi
How often does each model hold its ranking across 1,000 bootstrap resamples and after removing 10-20% of test data?
| Model | Rank | Rank Mean | Rank Std | Bootstrap Stability | LNO 10% | LNO 20% |
|---|---|---|---|---|---|---|
| Claude 3.5 Sonnet | #1 | 1.15 | 0.42 | 87% | 94% | 88% |
| GPT-4o | #2 | 1.92 | 0.55 | 78% | 89% | 81% |
| Nuance DAX | #3 | 3.10 | 0.72 | 71% | 82% | 73% |
| Heidi Health | #4 | 3.95 | 0.88 | 65% | 78% | 68% |
| MedGemma | #5 | 4.88 | 0.95 | 62% | 75% | 64% |
How We Test Robustness
Bootstrap resampling: We resample our test set 1,000 times and re-rank models each time. A model that is #1 in 950/1,000 resamples has a robust ranking. A model that fluctuates between #1 and #5 does not.
Metric perturbation: We re-rank using each individual metric (safety, adversarial, calibration, trust, worst-of-K) instead of the composite. Stable models maintain their position. Fragile models shift dramatically.
Leave-N-out stability: We remove 10% and 20% of test scenarios and re-rank. Rankings that are sensitive to small data changes are flagged as unreliable.
Cross-tier consistency: We compare rankings on Public vs Embargoed vs Holdout scenario tiers. A model that ranks #1 on public scenarios but #8 on holdout scenarios may have contamination or overfitting issues.
Problem Fingerprintsi
Following the Metrics Reloaded framework, each benchmark has a "problem fingerprint" that captures what we are measuring and why our metrics match the clinical intent. This prevents the common pitfall of choosing metrics by convention rather than domain relevance.
Clinical Triage
Emergency medicine
Classify patient acuity from conversation
Multi-turn patient dialogue
Triage level (critical/urgent/non-urgent/self-care)
QWK — Quadratic-Weighted Kappa (40% weight)
Info gathering (30%), escalation_safety (20%), Efficiency (10%)
QWK penalizes disagreements proportional to ordinal distance, standard in ESI/CTAS validation. Info gathering measures thoroughness. escalation_safety catches red flags. Efficiency penalizes unnecessary turns.
Incorrect triage can delay critical care or overwhelm EDs with non-urgent cases.
Differential Diagnosis
Internal medicine, emergency, surgery, psych, peds
Generate and rank differential diagnoses
Multi-turn patient presentation
Ranked list of diagnoses
NDCG@10 + MRR + Top-3 accuracy (35% weight)
Reciprocal rank (25%), Info gathering (25%), Uncertainty handling (15%)
NDCG@10 uses logarithmic position weighting to reward correct diagnoses ranked higher. MRR captures first-hit rank. Top-3 measures clinical utility (correct Dx in the working list). Uncertainty handling prevents premature commitment.
Missing the correct diagnosis in the differential delays treatment and investigation.
Clinical Summarization
Documentation across all specialties
Generate structured clinical notes from encounters
Encounter transcript or multi-turn interaction
SOAP note / H&P / discharge summary
BERTScore + LLM-as-judge (PDSQI-9)
Completeness, Accuracy (no hallucinations), Conciseness, Structure adherence
BERTScore captures semantic equivalence beyond token overlap. PDSQI-9 provides validated clinical note quality assessment. Missing findings lead to care gaps; hallucinated findings lead to unnecessary workups.
Clinical notes are legal documents. Errors propagate through the care chain.
Ambient Scribe Evaluation
Primary care, cardiology, psychiatry, orthopedics, and 5 more specialties
Generate structured clinical notes from simulated ambient encounter audio
Encounter transcript (25 scenarios across 9 specialties)
SOAP note with medications, assessment, and plan
Section completeness (40% weight)
Medication accuracy (25%), Hallucination rate (20%), Format adherence (15%)
Ambient scribes must capture all clinically relevant information without fabricating details. Medication errors in notes directly impact patient safety.
Scribe-generated notes become part of the medical record. Missed or hallucinated content can cause medication errors and care gaps.
Medical Hallucination Detection
Medical knowledge across specialties
Detect and avoid hallucinated medical facts
Medical claims and clinical statements
Hallucination classification (faithful/hallucinated)
Hallucination detection F1 (50% weight)
Precision (25%), Recall (25%)
Hallucinated medical facts can lead to dangerous clinical decisions. Both missing real hallucinations (low recall) and false alarms (low precision) erode trust.
A model that confidently states fabricated medical facts is actively dangerous in clinical settings.
Script Concordance Test
Clinical reasoning across specialties
Evaluate clinical reasoning under uncertainty
Clinical vignette with new information
Likert-scale judgment on hypothesis likelihood change
Concordance score vs expert panel (50% weight)
Calibration (30%), Reasoning quality (20%)
SCTs measure how well a model updates beliefs given new evidence, which is the core of clinical reasoning. Calibration ensures the model knows what it does not know.
Poor clinical reasoning under uncertainty leads to premature closure, anchoring bias, and missed diagnoses.
Psychometric Validation
Benchmark scenarios are validated using standard psychometric methods adapted from educational measurement.
Cronbach's alpha: Measures internal consistency within each benchmark subset. Benchmarks with α < 0.7 are flagged for item review.
Item discrimination: Each scenario is evaluated for its ability to distinguish between high- and low-performing models. Scenarios with near-zero discrimination (every model passes or every model fails) are candidates for replacement.
ICC test-retest: Intraclass Correlation Coefficient measures whether repeated evaluations (K=10) of the same model on the same scenario produce consistent scores. ICC values below 0.75 trigger investigation into scoring instability.
References
Maier-Hein, L. et al. "Why rankings of biomedical image analysis competitions should be interpreted with care." Nature Communications 9, 5217 (2018).
Maier-Hein, L. et al. "Metrics reloaded: recommendations for image analysis validation." Nature Methods 21(2), 195-212 (2024).
Maier-Hein, L. et al. "BIAS: Transparent reporting of biomedical image analysis challenges." Medical Image Analysis 66 (2020).