GPT-4o

StableFrontierMODELi

OpenAI

No score data available yet. Run an evaluation to see results here.

Not enough data points for trends. Need 2+ evaluations.

No reliability data available (K=1). Run evaluations with K>1 to see worst-of-K distributions.

Insufficient calibration data for this model. Need at least 5 samples with calibration scores.

No benchmark results yet. Evaluate this model to see per-benchmark performance.

Standard Benchmarks