AI Benchmarks: A Comprehensive Research Memo

Synthesized from the Papers - Computation & Investing database. All benchmarks and frameworks below are drawn directly from papers and articles saved in that database.

Prompt: Go through my Papers - Computation & Investing database and write a comprehensive memo detailing the information you find on AI benchmarks, share the benchmarks you find, and then try to synthesize what makes a good benchmark and the commonly used metrics.

Overview

AI benchmarking has become one of the most contested and consequential sub-fields in machine learning. Benchmarks serve as the shared yardstick by which models are compared, resources are allocated, and capabilities are marketed to the public. Yet as a Nature article saved in the database bluntly put it: "AI now beats humans at basic tasks — new benchmarks are needed." This memo catalogues the benchmarks found in the database, organizes them by domain, and synthesizes the common threads around what makes a benchmark rigorous, durable, and useful.

Part I: The Benchmarks — A Catalogue

🧠 Language Understanding & General Reasoning

MMLU (Massive Multitask Language Understanding)

Frequently referenced across multiple papers (particularly the BALROG paper), MMLU is a broad multi-domain benchmark covering 57 subjects ranging from elementary mathematics to professional law. It became the de facto standard for measuring general-purpose language understanding and has been largely saturated by frontier models.

Metric: Accuracy across subject categories
Limitation: Static; susceptible to data contamination; frontier models now exceed expert human performance on many subsets

SuperGLUE

The successor to GLUE, SuperGLUE tests general-purpose language understanding across tasks like question answering, coreference resolution, and natural language inference. Referenced in both the BALROG paper and the Import AI newsletter (Baidu took the SuperGLUE crown), it too has been largely saturated.

Metric: Aggregate accuracy / F1 across tasks
Status: Saturated; Baidu surpassed human baselines

BigBench (Beyond the Imitation Game)

Expands scope to include diverse linguistic and cognitive challenges deliberately chosen to be hard for current models. A forward-looking attempt to stay ahead of saturation.

Metric: Task-specific accuracy, normalized difficulty-weighted scores