Name Variant Category Published At Dataset Incoming Links Outgoing Links
AI2 Reasoning Challenge (arc) (acc/acc_norm)
Benchmark
Reasoning and knowledge 2024-01-01 0 0
AI2D
Benchmark
Perception 2024-01-01 0 1
AIME (American Invitational Mathematics Examination) 2024
Benchmark
Math 2024-01-01 0 8
AIME 2025
Benchmark
Math 2024-01-01 0 8
ARC (AI2 Reasoning Challenge) easy
Benchmark
Reasoning and knowledge 2024-01-01 0 1
ARC Challenge (Abstraction and Reasoning Challenge)
Benchmark
Reasoning and knowledge 2024-01-01 0 1
ARC-AGI-2
Benchmark
Reasoning and knowledge 2024-01-01 0 1
AV-Odyssey Bench
Benchmark
Perception 2024-01-01 0 1
Aider Polyglot
Benchmark
Coding 2024-01-01 0 5
AlpacaEval
Benchmark
2024-01-01 0 2
Arena-Hard
Benchmark
2024-01-01 0 2
BFCL 3
Benchmark
Agentic task execution 2024-01-01 0 1
BigBenchHard (BBH)
Benchmark
Reasoning and knowledge 2024-01-01 0 4
Bird-SQL (Dev)
Benchmark
Coding 2024-01-01 0 1
BoolQ
Benchmark
Factuality 2024-01-01 0 1
BrowseComp
Benchmark
Agentic task execution 2024-01-01 0 3
COLLIE
Benchmark
Instruction following 2024-01-01 0 1
CRUXEval
Benchmark
Coding 2024-01-01 0 1
CharXiv-Reasoning
Benchmark
Reasoning and knowledge 2024-01-01 0 3
ChartQA
Benchmark
Reasoning and knowledge 2024-01-01 0 2
Claude 3.7
Model
2025-02-24 8 0
Claude 4
Model
2025-05-22 7 0
CoVoST2 (21 lang)
Benchmark
Translation 2024-01-01 0 1
Codeforces
Benchmark
Coding 2024-01-01 0 4
DROP (Discrete Reasoning Over Paragraphs)
Benchmark
Reasoning and knowledge 2024-01-01 0 2
DeepSeek-R1
Model
2025-01-22 14 0
DocVQA
Benchmark
Perception 2024-01-01 0 2
ERQA
Benchmark
Commonsense 2024-01-01 0 1
EgoSchema
Benchmark
Perception 2024-01-01 0 2
EvalPlus
Benchmark
Coding 2024-01-01 0 1
FACTS Grounding
Benchmark
Factuality 2024-01-01 0 2
FRAMES (Factuality, Retrieval, And reasoning MEasurement Set)
Benchmark
Factuality 2024-01-01 0 1
FrontierMath
Benchmark
Math 2024-01-01 0 1
GPQA (diamond)
Benchmark
Reasoning and knowledge 2024-01-01 0 11
GPQA (main)
Benchmark
Reasoning and knowledge 2024-01-01 0 5
GPT4.5
Model
2025-02-27 6 0
GPT5
Model
2025-08-07 19 0
GSM8k
Benchmark
Math 2024-01-01 0 4
Gemini 2.0
Model
2024-12-11 13 0
Gemini 2.5
Model
2025-03-25 12 0
Gemini 2.5 Pro
Model
2025-03-25 2 0
Global MMLU (Lite)
Benchmark
Reasoning and knowledge 2024-01-01 0 2
Grok 3
Model
2025-02-19 9 0
Grok 4
Model
2025-07-09 7 0
HMMT 2025
Benchmark
Math 2024-01-01 0 2
HealthBench
Benchmark
Preference-Alignment 2024-01-01 0 1
HealthBench Hard
Benchmark
Preference-Alignment 2024-01-01 0 1
HealthBench Hard Hallucinations
Benchmark
Preference-Alignment 2024-01-01 0 1
HiddenMath
Benchmark
Math 2024-01-01 0 1
HuggingFace Open LLM Leaderboard
Model
2025-01-01 6 0
HumanEval
Benchmark
Coding 2024-01-01 0 1
Humanity's Last Exam
Benchmark
Reasoning and knowledge 2024-01-01 0 5
INCLUDE
Benchmark
Reasoning and knowledge 2024-01-01 0 1
Instella
Model
2025-03-05 11 0
Instruction-Following Evaluation (IFEval)
Benchmark
Instruction following 2024-01-01 0 4
LOFT
Benchmark
Long-context 2024-01-01 0 1
LiveBench
Benchmark
2024-01-01 0 1
LiveCodeBench (v5)
Benchmark
Coding 2024-01-01 0 6
Llama 4
Model
2025-04-05 1 0
Llama 4 Maverick
Model
2025-04-05 8 0
LongBench 2
Benchmark
Long-context 2024-01-01 0 1
MATH
Benchmark
Math 2024-01-01 0 5
MATH-500
Benchmark
Math 2024-01-01 0 2
MBPP (Mostly Basic Python Problems Dataset)
Benchmark
Coding 2024-01-01 0 1
MGSM (Multilingual Grade School Math Benchmark)
Benchmark
Math 2024-01-01 0 1
MM-MT-Bench
Benchmark
Instruction following 2024-01-01 0 1
MMLU
Benchmark
Reasoning and knowledge 2024-01-01 0 5
MMLU Pro
Benchmark
Reasoning and knowledge 2024-01-01 0 8
MMLU-Redux
Benchmark
Reasoning and knowledge 2024-01-01 0 2
MMMLU (Multilingual MMLU)
Benchmark
Reasoning and knowledge 2024-01-01 0 5
MMMU
Benchmark
Reasoning and knowledge 2024-01-01 0 11
MMMU-Pro
Benchmark
Reasoning and knowledge 2024-01-01 0 2
MTOB (half book and full book)
Benchmark
Translation 2024-01-01 0 1
MathVista
Benchmark
Math 2024-01-01 0 4
Mistral Small 3.1
Model
2025-03-17 17 0
Multi-If
Benchmark
Instruction following 2024-01-01 0 1
Multi-Round Co-reference Resolution (MRCR) (1M)
Benchmark
Long-context 2024-01-01 0 2
MultiPL-E
Benchmark
Coding 2024-01-01 0 1
Multistep Soft Reasoning (MuSR)
Benchmark
Reasoning and knowledge 2024-01-01 0 1
OLMo2 32B
Model
2025-03-13 9 0
OmniBench
Benchmark
Perception 2024-01-01 0 1
Physical Interaction: Question Answering (piqa) (acc/acc_norm)
Benchmark
Commonsense 2024-01-01 0 1
PopQA
Benchmark
Factuality 2024-01-01 0 1
Qwen2.5
Model
2025-03-27 5 0
Qwen2.5 Omni
Model
2025-03-27 0 0
Qwen3
Model
2025-04-29 24 0
RULER 128k
Benchmark
Long-context 2024-01-01 0 1
RULER 32k
Benchmark
Long-context 2024-01-01 0 1
SWE Lancer Diamond
Benchmark
Coding 2024-01-01 0 3
SWE-bench Verified
Benchmark
Coding 2024-01-01 0 8
Scale MultiChallenge
Benchmark
Instruction following 2024-01-01 0 3
SimpleQA
Benchmark
Factuality 2024-01-01 0 5
Social Interaction QA (siqa) (acc/acc_norm)
Benchmark
Commonsense 2024-01-01 0 1
SuperGPQA
Benchmark
Reasoning and knowledge 2024-01-01 0 1
TAU bench (Tool-Agent-User Interaction Benchmark)
Benchmark
Agentic task execution 2024-01-01 0 4
Tau2-bench
Benchmark
Agentic task execution 2024-01-01 0 1
Terminal-bench
Benchmark
Agentic task execution 2024-01-01 0 1
TruthfulQA
Benchmark
Factuality 2024-01-01 0 1
USAMO 2025
Benchmark
Math 2024-01-01 0 1
Vibe-Eval (Reka)
Benchmark
Perception 2024-01-01 0 1
VideoMMMU
Benchmark
Reasoning and knowledge 2024-01-01 0 2
commonsense_qa (acc/acc_norm)
Benchmark
Commonsense 2024-01-01 0 0
hellaswag (acc/acc_norm)
Benchmark
Commonsense 2024-01-01 0 1
o3
Model
2025-04-16 14 0
o4-mini
Model
2025-04-16 14 0
openbookqa (acc/acc_norm)
Benchmark
Reasoning and knowledge 2024-01-01 0 1
seed-tts-eval
Benchmark
Imitation 2024-01-01 0 1
winogrande (acc/acc_norm)
Benchmark
Commonsense 2024-01-01 0 1