In artificial intelligence, Humanity's Last Exam (HLE) is a benchmark for evaluating the capabilities of large language models.
It encompasses 3000 unambiguous and easily verifiable academic questions about mathematics, humanities, and the natural sciences contributed by almost 1000 subject-experts from over 500 institutions across 50 countries, providing expert-level human performance on closed-ended academic questions.
[citation needed] In response, HLE was introduced to provide a more challenging and comprehensive assessment tool.
[citation needed] The dataset is multi-modal, with approximately 10% of the questions requiring both image and text comprehension, while the remaining 90% are text-based.
[citation needed] State-of-the-art LLMs have demonstrated low accuracy on HLE, highlighting substantial room for improvement.