Benchmarks

  • AI2 Reasoning Challenge (arc) (acc/acc_norm)
  • AI2D
  • AIME (American Invitational Mathematics Examination) 2024
  • AIME 2025
  • ARC (AI2 Reasoning Challenge) easy
  • ARC Challenge (Abstraction and Reasoning Challenge)
  • ARC-AGI-2
  • AV-Odyssey Bench
  • Aider Polyglot
  • AlpacaEval
  • Arena-Hard
  • BFCL 3
  • BigBenchHard (BBH)
  • Bird-SQL (Dev)
  • BoolQ
  • BrowseComp
  • COLLIE
  • CRUXEval
  • CharXiv-Reasoning
  • ChartQA
  • CoVoST2 (21 lang)
  • Codeforces
  • DROP (Discrete Reasoning Over Paragraphs)
  • DocVQA
  • ERQA
  • EgoSchema
  • EvalPlus
  • FACTS Grounding
  • FRAMES (Factuality, Retrieval, And reasoning MEasurement Set)
  • FrontierMath
  • GPQA (diamond)
  • GPQA (main)
  • GSM8k
  • Global MMLU (Lite)
  • HMMT 2025
  • HealthBench
  • HealthBench Hard
  • HealthBench Hard Hallucinations
  • HiddenMath
  • HumanEval
  • Humanity's Last Exam
  • INCLUDE
  • Instruction-Following Evaluation (IFEval)
  • LOFT
  • LiveBench
  • LiveCodeBench (v5)
  • LongBench 2
  • MATH
  • MATH-500
  • MBPP (Mostly Basic Python Problems Dataset)
  • MGSM (Multilingual Grade School Math Benchmark)
  • MM-MT-Bench
  • MMLU
  • MMLU Pro
  • MMLU-Redux
  • MMMLU (Multilingual MMLU)
  • MMMU
  • MMMU-Pro
  • MTOB (half book and full book)
  • MathVista
  • Multi-If
  • Multi-Round Co-reference Resolution (MRCR) (1M)
  • MultiPL-E
  • Multistep Soft Reasoning (MuSR)
  • OmniBench
  • Physical Interaction: Question Answering (piqa) (acc/acc_norm)
  • PopQA
  • RULER 128k
  • RULER 32k
  • SWE Lancer Diamond
  • SWE-bench Verified
  • Scale MultiChallenge
  • SimpleQA
  • Social Interaction QA (siqa) (acc/acc_norm)
  • SuperGPQA
  • TAU bench (Tool-Agent-User Interaction Benchmark)
  • Tau2-bench
  • Terminal-bench
  • TruthfulQA
  • USAMO 2025
  • Vibe-Eval (Reka)
  • VideoMMMU
  • commonsense_qa (acc/acc_norm)
  • hellaswag (acc/acc_norm)
  • openbookqa (acc/acc_norm)
  • seed-tts-eval
  • winogrande (acc/acc_norm)

Datasets

Models

  • Claude 3.7
  • Claude 4
  • DeepSeek-R1
  • GPT4.5
  • GPT5
  • Gemini 2.0
  • Gemini 2.5
  • Gemini 2.5 Pro
  • Grok 3
  • Grok 4
  • HuggingFace Open LLM Leaderboard
  • Instella
  • Llama 4
  • Llama 4 Maverick
  • Mistral Small 3.1
  • OLMo2 32B
  • Qwen2.5
  • Qwen2.5 Omni
  • Qwen3
  • o3
  • o4-mini

Organizations

People

  • Home
  • Models
  • o3
Summary Data Network

o3

Model