| Name | Variant | Category | Published At | Dataset | Incoming Links | Outgoing Links |
| AI2 Reasoning Challenge (arc) (acc/acc_norm) |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 0 | |
| AI2D |
Benchmark
|
Perception | 2024-01-01 | 0 | 1 | |
| AIME (American Invitational Mathematics Examination) 2024 |
Benchmark
|
Math | 2024-01-01 | 0 | 8 | |
| AIME 2025 |
Benchmark
|
Math | 2024-01-01 | 0 | 8 | |
| ARC (AI2 Reasoning Challenge) easy |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 1 | |
| ARC Challenge (Abstraction and Reasoning Challenge) |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 1 | |
| ARC-AGI-2 |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 1 | |
| AV-Odyssey Bench |
Benchmark
|
Perception | 2024-01-01 | 0 | 1 | |
| Aider Polyglot |
Benchmark
|
Coding | 2024-01-01 | 0 | 5 | |
| AlpacaEval |
Benchmark
|
2024-01-01 | 0 | 2 | ||
| Arena-Hard |
Benchmark
|
2024-01-01 | 0 | 2 | ||
| BFCL 3 |
Benchmark
|
Agentic task execution | 2024-01-01 | 0 | 1 | |
| BigBenchHard (BBH) |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 4 | |
| Bird-SQL (Dev) |
Benchmark
|
Coding | 2024-01-01 | 0 | 1 | |
| BoolQ |
Benchmark
|
Factuality | 2024-01-01 | 0 | 1 | |
| BrowseComp |
Benchmark
|
Agentic task execution | 2024-01-01 | 0 | 3 | |
| COLLIE |
Benchmark
|
Instruction following | 2024-01-01 | 0 | 1 | |
| CRUXEval |
Benchmark
|
Coding | 2024-01-01 | 0 | 1 | |
| CharXiv-Reasoning |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 3 | |
| ChartQA |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 2 | |
| Claude 3.7 |
Model
|
2025-02-24 | 8 | 0 | ||
| Claude 4 |
Model
|
2025-05-22 | 7 | 0 | ||
| CoVoST2 (21 lang) |
Benchmark
|
Translation | 2024-01-01 | 0 | 1 | |
| Codeforces |
Benchmark
|
Coding | 2024-01-01 | 0 | 4 | |
| DROP (Discrete Reasoning Over Paragraphs) |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 2 | |
| DeepSeek-R1 |
Model
|
2025-01-22 | 14 | 0 | ||
| DocVQA |
Benchmark
|
Perception | 2024-01-01 | 0 | 2 | |
| ERQA |
Benchmark
|
Commonsense | 2024-01-01 | 0 | 1 | |
| EgoSchema |
Benchmark
|
Perception | 2024-01-01 | 0 | 2 | |
| EvalPlus |
Benchmark
|
Coding | 2024-01-01 | 0 | 1 | |
| FACTS Grounding |
Benchmark
|
Factuality | 2024-01-01 | 0 | 2 | |
| FRAMES (Factuality, Retrieval, And reasoning MEasurement Set) |
Benchmark
|
Factuality | 2024-01-01 | 0 | 1 | |
| FrontierMath |
Benchmark
|
Math | 2024-01-01 | 0 | 1 | |
| GPQA (diamond) |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 11 | |
| GPQA (main) |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 5 | |
| GPT4.5 |
Model
|
2025-02-27 | 6 | 0 | ||
| GPT5 |
Model
|
2025-08-07 | 19 | 0 | ||
| GSM8k |
Benchmark
|
Math | 2024-01-01 | 0 | 4 | |
| Gemini 2.0 |
Model
|
2024-12-11 | 13 | 0 | ||
| Gemini 2.5 |
Model
|
2025-03-25 | 12 | 0 | ||
| Gemini 2.5 Pro |
Model
|
2025-03-25 | 2 | 0 | ||
| Global MMLU (Lite) |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 2 | |
| Grok 3 |
Model
|
2025-02-19 | 9 | 0 | ||
| Grok 4 |
Model
|
2025-07-09 | 7 | 0 | ||
| HMMT 2025 |
Benchmark
|
Math | 2024-01-01 | 0 | 2 | |
| HealthBench |
Benchmark
|
Preference-Alignment | 2024-01-01 | 0 | 1 | |
| HealthBench Hard |
Benchmark
|
Preference-Alignment | 2024-01-01 | 0 | 1 | |
| HealthBench Hard Hallucinations |
Benchmark
|
Preference-Alignment | 2024-01-01 | 0 | 1 | |
| HiddenMath |
Benchmark
|
Math | 2024-01-01 | 0 | 1 | |
| HuggingFace Open LLM Leaderboard |
Model
|
2025-01-01 | 6 | 0 | ||
| HumanEval |
Benchmark
|
Coding | 2024-01-01 | 0 | 1 | |
| Humanity's Last Exam |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 5 | |
| INCLUDE |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 1 | |
| Instella |
Model
|
2025-03-05 | 11 | 0 | ||
| Instruction-Following Evaluation (IFEval) |
Benchmark
|
Instruction following | 2024-01-01 | 0 | 4 | |
| LOFT |
Benchmark
|
Long-context | 2024-01-01 | 0 | 1 | |
| LiveBench |
Benchmark
|
2024-01-01 | 0 | 1 | ||
| LiveCodeBench (v5) |
Benchmark
|
Coding | 2024-01-01 | 0 | 6 | |
| Llama 4 |
Model
|
2025-04-05 | 1 | 0 | ||
| Llama 4 Maverick |
Model
|
2025-04-05 | 8 | 0 | ||
| LongBench 2 |
Benchmark
|
Long-context | 2024-01-01 | 0 | 1 | |
| MATH |
Benchmark
|
Math | 2024-01-01 | 0 | 5 | |
| MATH-500 |
Benchmark
|
Math | 2024-01-01 | 0 | 2 | |
| MBPP (Mostly Basic Python Problems Dataset) |
Benchmark
|
Coding | 2024-01-01 | 0 | 1 | |
| MGSM (Multilingual Grade School Math Benchmark) |
Benchmark
|
Math | 2024-01-01 | 0 | 1 | |
| MM-MT-Bench |
Benchmark
|
Instruction following | 2024-01-01 | 0 | 1 | |
| MMLU |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 5 | |
| MMLU Pro |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 8 | |
| MMLU-Redux |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 2 | |
| MMMLU (Multilingual MMLU) |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 5 | |
| MMMU |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 11 | |
| MMMU-Pro |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 2 | |
| MTOB (half book and full book) |
Benchmark
|
Translation | 2024-01-01 | 0 | 1 | |
| MathVista |
Benchmark
|
Math | 2024-01-01 | 0 | 4 | |
| Mistral Small 3.1 |
Model
|
2025-03-17 | 17 | 0 | ||
| Multi-If |
Benchmark
|
Instruction following | 2024-01-01 | 0 | 1 | |
| Multi-Round Co-reference Resolution (MRCR) (1M) |
Benchmark
|
Long-context | 2024-01-01 | 0 | 2 | |
| MultiPL-E |
Benchmark
|
Coding | 2024-01-01 | 0 | 1 | |
| Multistep Soft Reasoning (MuSR) |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 1 | |
| OLMo2 32B |
Model
|
2025-03-13 | 9 | 0 | ||
| OmniBench |
Benchmark
|
Perception | 2024-01-01 | 0 | 1 | |
| Physical Interaction: Question Answering (piqa) (acc/acc_norm) |
Benchmark
|
Commonsense | 2024-01-01 | 0 | 1 | |
| PopQA |
Benchmark
|
Factuality | 2024-01-01 | 0 | 1 | |
| Qwen2.5 |
Model
|
2025-03-27 | 5 | 0 | ||
| Qwen2.5 Omni |
Model
|
2025-03-27 | 0 | 0 | ||
| Qwen3 |
Model
|
2025-04-29 | 24 | 0 | ||
| RULER 128k |
Benchmark
|
Long-context | 2024-01-01 | 0 | 1 | |
| RULER 32k |
Benchmark
|
Long-context | 2024-01-01 | 0 | 1 | |
| SWE Lancer Diamond |
Benchmark
|
Coding | 2024-01-01 | 0 | 3 | |
| SWE-bench Verified |
Benchmark
|
Coding | 2024-01-01 | 0 | 8 | |
| Scale MultiChallenge |
Benchmark
|
Instruction following | 2024-01-01 | 0 | 3 | |
| SimpleQA |
Benchmark
|
Factuality | 2024-01-01 | 0 | 5 | |
| Social Interaction QA (siqa) (acc/acc_norm) |
Benchmark
|
Commonsense | 2024-01-01 | 0 | 1 | |
| SuperGPQA |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 1 | |
| TAU bench (Tool-Agent-User Interaction Benchmark) |
Benchmark
|
Agentic task execution | 2024-01-01 | 0 | 4 | |
| Tau2-bench |
Benchmark
|
Agentic task execution | 2024-01-01 | 0 | 1 | |
| Terminal-bench |
Benchmark
|
Agentic task execution | 2024-01-01 | 0 | 1 | |
| TruthfulQA |
Benchmark
|
Factuality | 2024-01-01 | 0 | 1 | |
| USAMO 2025 |
Benchmark
|
Math | 2024-01-01 | 0 | 1 | |
| Vibe-Eval (Reka) |
Benchmark
|
Perception | 2024-01-01 | 0 | 1 | |
| VideoMMMU |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 2 | |
| commonsense_qa (acc/acc_norm) |
Benchmark
|
Commonsense | 2024-01-01 | 0 | 0 | |
| hellaswag (acc/acc_norm) |
Benchmark
|
Commonsense | 2024-01-01 | 0 | 1 | |
| o3 |
Model
|
2025-04-16 | 14 | 0 | ||
| o4-mini |
Model
|
2025-04-16 | 14 | 0 | ||
| openbookqa (acc/acc_norm) |
Benchmark
|
Reasoning and knowledge | 2024-01-01 | 0 | 1 | |
| seed-tts-eval |
Benchmark
|
Imitation | 2024-01-01 | 0 | 1 | |
| winogrande (acc/acc_norm) |
Benchmark
|
Commonsense | 2024-01-01 | 0 | 1 |