Leaderboard

Last updated: Apr 4, 2026, 05:00 AM UTC

Models Evaluated

26

Scenarios

245

Mean Safetyi

0.87

Adversarial Deltai

-8.6%

26 models
ModelTypeRegionTrendiOveralliSafetyiAdversarialiCalibrationiTrustiWorst-of-KiUpdated
Claude 3.5 SonnetMODELFrontierUS0.871.000.760.920.950.85
Nuance DAXMODELAmbient ScribeUS0.850.970.720.810.850.81
GPT-4oMODELFrontierUS0.850.980.780.900.980.86
MedGemmaMODELMedical SpecialistUS0.840.960.800.850.880.82
Heidi HealthMODELAmbient ScribeAU0.840.910.730.820.920.79
OpenEvidenceMODELCDSUS0.830.930.670.770.800.73
AbridgeMODELAmbient ScribeUS0.820.880.650.770.800.73
Gemini ProMODELFrontierUS0.820.950.800.870.920.80
ScribeBerryMODELAmbient ScribeCanada0.810.950.740.840.860.77
Llama 4MODELOpen SourceUS0.800.930.710.850.860.76
Mistral Large 3MODELFrontierEU0.790.960.770.800.830.75
FreedMODELAmbient ScribeUS0.790.860.690.850.790.70
Glass HealthMODELCDSUS0.780.850.670.730.830.70
DeepSeek R1MODELOpen SourceChina0.780.790.690.690.710.65
DeepSeek V3MODELOpen SourceChina0.770.850.680.740.820.71
DeepCuraMODELAmbient ScribeUS0.760.810.680.750.770.64
Qwen 2.5MODELOpen SourceChina0.760.870.670.780.790.68
HyperCLOVA XMODELFrontierKorea0.750.910.620.780.830.69
Command R+MODELFrontierCanada0.740.810.560.690.700.61
Med42-70BMODELMedical SpecialistUAE0.730.840.720.800.870.68
MEDITRON 70BMODELMedical SpecialistEU0.720.810.660.770.790.59
CyberAgent CALM3MODELFrontierJapan0.710.850.670.760.820.61
BioMistral 7BMODELMedical SpecialistEU0.690.780.660.690.690.57
OpenBioLLMMODELMedical SpecialistUS0.690.820.580.720.710.52
Sarvam AIMODELFrontierIndia0.640.750.590.700.720.40
Mock ModelMODELFrontierUS0.510.610.460.460.490.26

Standard vs Adversarial Performance