Ranking Robustness & Sensitivity Analysis

Research shows that benchmark rankings are fragile. Changing the metric, the test data, or the annotators can change which model "wins" (Maier-Hein et al., Nature Communications 2018). We address this across ~400 clinical scenarios by reporting multiple metrics (QWK, NDCG, BERTScore, PDSQI-9, escalation_safety, refusal_safety) and testing how rankings respond to perturbation. Each scenario is run K=10 times per model.

Showing demo data. Real sensitivity analysis will be generated from actual evaluation runs.

Benchmark Sensitivity Reporti

The same models, ranked by different criteria. If the top 3 are the same regardless of metric, the ranking is robust. If they shift, the "best" model depends on what you care about.

Ranking Criterion#1#2#3Observation
Overall CompositeClaude 3.5 SonnetGPT-4oNuance DAXStandard ranking. Weighted across all dimensions.
Safety OnlyNuance DAXHeidi HealthClaude 3.5 SonnetScribes rise when safety is the sole criterion. Frontier LLMs drop.
Adversarial OnlyGPT-4oHeidi HealthClaude 3.5 SonnetModels with strong safety training perform best under adversarial pressure.
Worst-of-K OnlyClaude 3.5 SonnetMedGemmaGPT-4oReliability-focused ranking. Models with low variance win.
Triage OnlyClaude 3.5 SonnetGPT-4oMedGemmaClinical triage ranking. Scribes excluded (not evaluated on triage).
Remove 20% Test CasesClaude 3.5 SonnetGPT-4oNuance DAXTop 2 positions stable. Position 3+ shifts with data subsample.

Bootstrap Rank Stabilityi

How often does each model hold its ranking across 1,000 bootstrap resamples and after removing 10-20% of test data?

ModelRankRank MeanRank StdBootstrap StabilityLNO 10%LNO 20%
Claude 3.5 Sonnet#11.150.4287%94%88%
GPT-4o#21.920.5578%89%81%
Nuance DAX#33.100.7271%82%73%
Heidi Health#43.950.8865%78%68%
MedGemma#54.880.9562%75%64%

How We Test Robustness

Bootstrap resampling: We resample our test set 1,000 times and re-rank models each time. A model that is #1 in 950/1,000 resamples has a robust ranking. A model that fluctuates between #1 and #5 does not.

Metric perturbation: We re-rank using each individual metric (safety, adversarial, calibration, trust, worst-of-K) instead of the composite. Stable models maintain their position. Fragile models shift dramatically.

Leave-N-out stability: We remove 10% and 20% of test scenarios and re-rank. Rankings that are sensitive to small data changes are flagged as unreliable.

Cross-tier consistency: We compare rankings on Public vs Embargoed vs Holdout scenario tiers. A model that ranks #1 on public scenarios but #8 on holdout scenarios may have contamination or overfitting issues.

Problem Fingerprintsi

Following the Metrics Reloaded framework, each benchmark has a "problem fingerprint" that captures what we are measuring and why our metrics match the clinical intent. This prevents the common pitfall of choosing metrics by convention rather than domain relevance.

Clinical Triage

Domain

Emergency medicine

Task

Classify patient acuity from conversation

Input

Multi-turn patient dialogue

Output

Triage level (critical/urgent/non-urgent/self-care)

Primary Metric

QWK — Quadratic-Weighted Kappa (40% weight)

Secondary Metrics

Info gathering (30%), escalation_safety (20%), Efficiency (10%)

Why These Metrics

QWK penalizes disagreements proportional to ordinal distance, standard in ESI/CTAS validation. Info gathering measures thoroughness. escalation_safety catches red flags. Efficiency penalizes unnecessary turns.

Clinical Stakes

Incorrect triage can delay critical care or overwhelm EDs with non-urgent cases.

Differential Diagnosis

Domain

Internal medicine, emergency, surgery, psych, peds

Task

Generate and rank differential diagnoses

Input

Multi-turn patient presentation

Output

Ranked list of diagnoses

Primary Metric

NDCG@10 + MRR + Top-3 accuracy (35% weight)

Secondary Metrics

Reciprocal rank (25%), Info gathering (25%), Uncertainty handling (15%)

Why These Metrics

NDCG@10 uses logarithmic position weighting to reward correct diagnoses ranked higher. MRR captures first-hit rank. Top-3 measures clinical utility (correct Dx in the working list). Uncertainty handling prevents premature commitment.

Clinical Stakes

Missing the correct diagnosis in the differential delays treatment and investigation.

Clinical Summarization

Domain

Documentation across all specialties

Task

Generate structured clinical notes from encounters

Input

Encounter transcript or multi-turn interaction

Output

SOAP note / H&P / discharge summary

Primary Metric

BERTScore + LLM-as-judge (PDSQI-9)

Secondary Metrics

Completeness, Accuracy (no hallucinations), Conciseness, Structure adherence

Why These Metrics

BERTScore captures semantic equivalence beyond token overlap. PDSQI-9 provides validated clinical note quality assessment. Missing findings lead to care gaps; hallucinated findings lead to unnecessary workups.

Clinical Stakes

Clinical notes are legal documents. Errors propagate through the care chain.

Ambient Scribe Evaluation

Domain

Primary care, cardiology, psychiatry, orthopedics, and 5 more specialties

Task

Generate structured clinical notes from simulated ambient encounter audio

Input

Encounter transcript (25 scenarios across 9 specialties)

Output

SOAP note with medications, assessment, and plan

Primary Metric

Section completeness (40% weight)

Secondary Metrics

Medication accuracy (25%), Hallucination rate (20%), Format adherence (15%)

Why These Metrics

Ambient scribes must capture all clinically relevant information without fabricating details. Medication errors in notes directly impact patient safety.

Clinical Stakes

Scribe-generated notes become part of the medical record. Missed or hallucinated content can cause medication errors and care gaps.

Medical Hallucination Detection

Domain

Medical knowledge across specialties

Task

Detect and avoid hallucinated medical facts

Input

Medical claims and clinical statements

Output

Hallucination classification (faithful/hallucinated)

Primary Metric

Hallucination detection F1 (50% weight)

Secondary Metrics

Precision (25%), Recall (25%)

Why These Metrics

Hallucinated medical facts can lead to dangerous clinical decisions. Both missing real hallucinations (low recall) and false alarms (low precision) erode trust.

Clinical Stakes

A model that confidently states fabricated medical facts is actively dangerous in clinical settings.

Script Concordance Test

Domain

Clinical reasoning across specialties

Task

Evaluate clinical reasoning under uncertainty

Input

Clinical vignette with new information

Output

Likert-scale judgment on hypothesis likelihood change

Primary Metric

Concordance score vs expert panel (50% weight)

Secondary Metrics

Calibration (30%), Reasoning quality (20%)

Why These Metrics

SCTs measure how well a model updates beliefs given new evidence, which is the core of clinical reasoning. Calibration ensures the model knows what it does not know.

Clinical Stakes

Poor clinical reasoning under uncertainty leads to premature closure, anchoring bias, and missed diagnoses.

Psychometric Validation

Benchmark scenarios are validated using standard psychometric methods adapted from educational measurement.

Cronbach's alpha: Measures internal consistency within each benchmark subset. Benchmarks with α < 0.7 are flagged for item review.

Item discrimination: Each scenario is evaluated for its ability to distinguish between high- and low-performing models. Scenarios with near-zero discrimination (every model passes or every model fails) are candidates for replacement.

ICC test-retest: Intraclass Correlation Coefficient measures whether repeated evaluations (K=10) of the same model on the same scenario produce consistent scores. ICC values below 0.75 trigger investigation into scoring instability.

References

Maier-Hein, L. et al. "Why rankings of biomedical image analysis competitions should be interpreted with care." Nature Communications 9, 5217 (2018).

Maier-Hein, L. et al. "Metrics reloaded: recommendations for image analysis validation." Nature Methods 21(2), 195-212 (2024).

Maier-Hein, L. et al. "BIAS: Transparent reporting of biomedical image analysis challenges." Medical Image Analysis 66 (2020).