Longitudinal Tracking & Drift Detection
How we detect when healthcare AI models silently change—and why nobody else does this.
Why Temporal Tracking Matters
Models update silently. GPT-4o in January is not the same model as GPT-4o in March. Without continuous tracking, nobody knows when a model gets better, worse, or unsafe.
Provider changelogs are inconsistent or absent. Weight updates, RLHF refinements, and safety patches all shift behavior in ways that matter for clinical use. The only way to know is to measure.
Our Approach
Same benchmarks, same ~400 clinical scenarios, same fixed scaffolding, every month. No prompt changes, no scoring adjustments between runs. Each scenario is run K=10 times. The only variable is the model itself. Currently tracking 3 models evaluated.
Model Adropped from 0.84 → 0.74 between Feb and Apr 2026 (paired bootstrap p < 0.01). Coincides with provider version bump gpt-4o-2025-02 → gpt-4o-2025-04. Auto-issued GitHub alert via model-version-check.yml.
monitoring/regression.pyEach evaluation run is a full pass across every benchmark subset, including adversarial scenarios. Scores, confidence intervals, and per-scenario results are stored permanently so any comparison can be re-derived.
Drift Detection Methodology
We use two complementary non-parametric statistical tests to detect performance changes:
- Kolmogorov-Smirnov test: Detects distributional changes in score patterns. Catches shifts in variance and shape, not just the mean.
- Mann-Whitney U test: Compares ranked performance between evaluation periods. Robust to outliers and non-normal distributions common in clinical scoring.
- Cohen's d effect size: Quantifies the magnitude of the change. A statistically significant drift with a tiny effect size is flagged differently from a large regression.
Non-parametric tests are essential here because clinical evaluation scores are rarely normally distributed. Parametric alternatives would produce unreliable p-values on this data.
Regression Detection Framework
Each monitoring run uses a two-pronged scenario selection strategy:
- Stratified selection: A representative subset covering all risk levels and clinical domains ensures broad regression coverage across the ~400 clinical scenarios.
- Worst-case subset: Scenarios where the model previously scored lowest are re-evaluated every run, providing sensitivity to regressions in known weak areas.
Asymmetric safety thresholds. Safety regressions are held to stricter thresholds than performance regressions. A 3% drop in escalation_safety or refusal_safety triggers an alert; a 5% drop in accuracy-related metrics (QWK, NDCG, BERTScore) triggers the same alert level. This asymmetry reflects the clinical reality that safety failures carry higher consequence than accuracy degradation.
Drift Alerts
When drift is detected, alerts are surfaced directly on the model profile page with before/after data:
Alerts persist on the model profile until the next evaluation confirms recovery or the regression stabilizes. Persistent drift triggers a full re-evaluation across all benchmarks.
Nobody Else Does This
HealthBench tests once. MedHELM tests once. We test every month.
A model that was safe in January might not be safe in March. Silent model updates have been documented to cause 10–20% performance swings on established benchmarks. In healthcare, that is the difference between a safe recommendation and a dangerous one.
Our longitudinal tracking is inspired by MarginLabs' SWE-bench temporal analysis, adapted for clinical AI. The key insight is the same: point-in-time evaluation is insufficient for systems that change under you. Continuous monitoring is not optional for clinical deployment.
- Point-in-time benchmarks tell you how a model performed on one day. We tell you how it performs over time.
- Self-reported evals let vendors cherry-pick their best day. Our independent monthly tracking shows the full trajectory.
- Drift detection is a clinical safety requirement, not a nice-to-have. We treat it accordingly.