GPT-4o
StableFrontierOpenAI
No score data available yet. Run an evaluation to see results here.
Not enough data points for trends. Need 2+ evaluations.
No reliability data available (K=1). Run evaluations with K>1 to see worst-of-K distributions.
Insufficient calibration data for this model. Need at least 5 samples with calibration scores.
No benchmark results yet. Evaluate this model to see per-benchmark performance.
Standard Benchmarks
| Benchmark | Score | Items |
|---|---|---|
| MedQA (USMLE) | 91.1% | 1273 |
| MMLU Medical | 92.4% | 772 |
| PubMedQA | 77.7% | 500 |