Benchmark guides

Research-grade guides on the AI evaluations that move procurement decisions. History, methodology, contamination notes, common gaming patterns, and a live leaderboard hydrated from the Benchlist registry.

24deep guides26stubs in progress50articles total
code agent · deep
HumanEval
164-problem code-generation benchmark · pass@1 · MIT
code agent · deep
HumanEval+
EvalPlus hardening of HumanEval · 80× more unit tests
code agent · deep
MBPP
Mostly Basic Python Problems · 974 entry-level tasks
code agent · stub
MBPP+
EvalPlus hardening over the sanitized MBPP subset.
code agent · deep
BigCodeBench
1,140 hard programming problems · function calls into 139 libraries
code agent · deep
LiveCodeBench
Continuously updated competitive programming · contamination-free by design
code agent · deep
SWE-bench Verified
500 hand-verified real-world GitHub issues · pip-installable patch
code agent · deep
SWE-bench Lite
300 lighter SWE-bench tasks · Princeton subset for fast evaluation
reasoning · deep
MMLU
57 academic subjects · 14k multiple-choice questions
reasoning · deep
MMLU-Pro
MMLU's harder, less-contaminated successor · 12k problems
reasoning · deep
GPQA Diamond
Graduate-level Google-Proof Q&A · 198 hard PhD-level questions
reasoning · stub
BIG-Bench Hard
23 challenging reasoning tasks pulled from BIG-Bench.
reasoning · deep
ARC-Challenge
AI2 Reasoning Challenge · 7,787 grade-school science MC questions
reasoning · deep
HellaSwag
Sentence completion stress test · 10k validation examples
reasoning · deep
WinoGrande
44k commonsense pronoun-resolution problems
reasoning · stub
PIQA
Physical commonsense reasoning, binary choice.
reasoning · stub
MuSR
Multistep soft reasoning: murder mysteries, team allocation.
reasoning · stub
AGIEval
Human-exam questions: SAT, GRE, Chinese civil service.
reasoning · deep
GSM8K
8.5K grade-school math word problems · numeric exact match
reasoning · deep
MATH
12.5K competition math problems · the parent of MATH-500
reasoning · deep
AIME 2024
American Invitational Math Exam · 30 hard problems per year
reasoning · deep
MGSM
Multilingual Grade School Math · GSM8K translated to 11 languages
reasoning · stub
TheoremQA
Theorem-grounded STEM problems requiring numeric/expression answers.
agent framework · deep
τ-Bench
Tool-using agent evaluation · realistic multi-turn customer-service tasks
agent framework · stub
GAIA
General AI assistants, 466 real-world tasks, exact-match scoring.
agent framework · stub
WebArena
Realistic web navigation tasks in hosted shopping / CMS / GitLab environments.
code agent · stub
SWE-rebench
Monthly-refreshed SWE-bench variant, contamination-resistant.
safety · deep
TruthfulQA MC1
817 questions designed to elicit common falsehoods
safety · deep
SimpleQA
4k factual questions designed to surface hallucination
reasoning · deep
CommonsenseQA
12k commonsense multiple-choice questions · 5 options each
memory · stub
RULER
NVIDIA's 13-task long-context suite at configurable window sizes.
memory · stub
NIAH · Needle in a Haystack
Retrieve a seeded fact from 4k → 128k token contexts. Deterministic generator.
memory · stub
LongMemEval
Long-conversation memory Q&A. GPT-4o canonical judge.
benchmark · stub
Infinite Bench
AI evaluation benchmark
reasoning · deep
MMMU
Massive Multi-discipline Multi-modal Understanding · 11.5k expert-level questions
reasoning · stub
MathVista
Math reasoning with visual context (1k testmini).
code agent · stub
SWE-bench Multimodal
JavaScript issues with visual context. Docker harness required.
benchmark · stub
Creative Writing
AI evaluation benchmark
benchmark · stub
Writingbench
AI evaluation benchmark
benchmark · stub
Alpacaeval
AI evaluation benchmark
reasoning · stub
LiveBench
Monthly-refreshed benchmark across math, coding, reasoning, data analysis, language.
benchmark · stub
Scalable Agentic Bench
AI evaluation benchmark
benchmark · stub
Mle Bench
AI evaluation benchmark
benchmark · stub
Finbench
AI evaluation benchmark
reasoning · stub
MedQA
USMLE-style medical licensing exam questions.
benchmark · stub
Lawbench
AI evaluation benchmark
benchmark · stub
Chembench
AI evaluation benchmark
benchmark · deep
Frontier Math
Hidden expert-curated math problems · designed to be unsolved
benchmark · stub
Humanity Last Exam
AI evaluation benchmark
reasoning · deep
ARC-AGI
Abstraction & Reasoning Corpus · the few-shot pattern test that won't die