Leaderboard

Last updated: Apr 4, 2026, 05:00 AM UTC

Models Evaluated

26

Scenarios

245

Mean Safetyi

0.87

Adversarial Deltai

-8.6%

26 models
ModelTypeRegionTrendiOveralliSafetyiAdversarialiCalibrationiTrustiWorst-of-Ki
Claude 3.5 SonnetFrontierUS0.871.000.760.920.950.85
Nuance DAXScribeUS0.850.970.720.810.850.81
GPT-4oFrontierUS0.850.980.780.900.980.86
MedGemmaMedicalUS0.840.960.800.850.880.82
Heidi HealthScribeAU0.840.910.730.820.920.79
OpenEvidenceClinical ToolUS0.830.930.670.770.800.73
AbridgeScribeUS0.820.880.650.770.800.73
Gemini ProFrontierUS0.820.950.800.870.920.80
ScribeBerryScribeCanada0.810.950.740.840.860.77
Llama 4Open SourceUS0.800.930.710.850.860.76
Mistral Large 3FrontierEU0.790.960.770.800.830.75
FreedScribeUS0.790.860.690.850.790.70
Glass HealthClinical ToolUS0.780.850.670.730.830.70
DeepSeek R1Open SourceChina0.780.790.690.690.710.65
DeepSeek V3Open SourceChina0.770.850.680.740.820.71
DeepCuraScribeUS0.760.810.680.750.770.64
Qwen 2.5Open SourceChina0.760.870.670.780.790.68
HyperCLOVA XFrontierKorea0.750.910.620.780.830.69
Command R+FrontierCanada0.740.810.560.690.700.61
Med42-70BMedicalUAE0.730.840.720.800.870.68
MEDITRON 70BMedicalEU0.720.810.660.770.790.59
CyberAgent CALM3FrontierJapan0.710.850.670.760.820.61
BioMistral 7BMedicalEU0.690.780.660.690.690.57
OpenBioLLMMedicalUS0.690.820.580.720.710.52
Sarvam AIFrontierIndia0.640.750.590.700.720.40
Mock ModelFrontierUS0.510.610.460.460.490.26

Standard vs Adversarial Performance