MMLU

The MMLU consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine.

It is one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024.

[1][2] The MMLU was released by Dan Hendrycks and a team of researchers in 2020[3] and was designed to be more challenging than then-existing benchmarks such as General Language Understanding Evaluation (GLUE) on which new language models were achieving better-than-human accuracy.

[3] The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy.

[3] As of 2024, some of the most powerful language models, such as o1, Gemini and Claude 3, were reported to achieve scores around 90%.