Humanity's Last Exam 2026: Only 3 AI Models Passed โ Here Are the Results
#1 AI Platform in Bangladesh
2026-02-22 | Benchmarks
Humanity's Last Exam 2026: Only 3 AI Models Passed
Forget GPQA Diamond. Forget SWE-bench. There is now a test so hard that
PhD researchers in 100+ fields collaborated to create questions specifically designed to be unsolvable by AI.
It's called
Humanity's Last Exam (HLE), and it was supposed to be the benchmark that AI couldn't beat for years.
It lasted four months.
---
๐งช What Is Humanity's Last Exam?
HLE was created by the Center for AI Safety in late 2025. The setup is brutal:
*
3,000 questions written by experts across mathematics, law, medicine, philosophy, music theory, ancient languages, and obscure scientific disciplines
Questions are designed to require expert-level domain knowledge** combined with *multi-step reasoning
* No multiple choice โ all free-form response
* Vetted to ensure no question appears in any public training dataset
The passing threshold was set at 45%. When it launched in October 2025, the best AI scored
18.2% (GPT-5). The creators expected the barrier to hold until at least 2027.
---
๐ The February 2026 Results
| Rank | Model | HLE Score | Passed? | Notable |
| :---: | :--- | :---: | :---: | :--- |
| ๐ฅ 1 |
Kimi k2.5* (Moonshot AI) | *50.2% | โ
| Agent Swarm mode enabled |
| ๐ฅ 2 |
Claude Opus 4.6* (Anthropic) | *48.1% | โ
| Adaptive Thinking activated |
| ๐ฅ 3 |
GPT-5.2 Pro* (OpenAI) | *46.8% | โ
| Deep Research mode |
| 4 |
Grok 4.2 Heavy (xAI) | 44.4% | โ | Failed humanities by 3 points |
| 5 |
Gemini 3 Pro (Google) | 43.7% | โ | Strong on science, weak on philosophy |
| 6 |
GPT-5.2 (Standard) | 41.3% | โ | No Deep Research mode |
| 7 |
DeepSeek R1 | 38.5% | โ | Best open-source score |
| 8 |
Llama 4 (70B) | 29.1% | โ | Context window limitation |
---
๐ Why Kimi k2.5 Won (And Nobody Expected It)
The biggest shockwave wasn't that AI passed at all โ it was that a Chinese open-source model from
Moonshot AI beat both OpenAI and Anthropic.
How it did it: Agent Swarm.
Kimi k2.5 doesn't brute-force answers. Instead, when faced with a complex question, it:
1.
Decomposes the question into 3-5 sub-problems
2.
Spawns specialized sub-agents โ one for mathematical reasoning, one for domain knowledge retrieval, one for logical consistency checking
3.
Merges the sub-agent outputs and performs contradiction resolution
4.
Self-critiques the merged answer before submitting
This multi-agent approach cracked questions that monolithic models couldn't. For example, on a Byzantine music theory question requiring knowledge of both 8th-century notation systems AND modern harmonic analysis, Kimi's "History Agent" and "Music Theory Agent" independently solved their parts and merged a correct answer. GPT-5.2 hallucinated the notation system entirely.
---
๐ง Where Grok 4.2 Failed
Grok 4.2's 44.4% was tantalizingly close to passing. Its failure pattern reveals an important limitation:
| Domain | Grok 4.2 Score | Average |
| :--- | :---: | :---: |
|
Mathematics* | *62% | 48% |
|
Computer Science* | *58% | 45% |
|
Natural Sciences | 51% | 47% |
|
Medicine | 42% | 40% |
|
Philosophy | 28% | 35% |
|
Humanities | 24% | 32% |
|
Arts & Music | 19% | 28% |
Grok excels at STEM but collapses on subjective, culturally-nuanced domains. Its training on X (Twitter) data gives it strong factual retrieval but weak interpretive reasoning. When asked to analyze a Dostoevsky passage through the lens of Kierkegardian existentialism, it produced a technically accurate but emotionally tone-deaf response that evaluators marked as "missing the point entirely."
---
๐ก What This Means For Your AI Choice
HLE isn't just an academic exercise. It reveals which models can handle
genuinely novel problems โ the kind you encounter in real work:
*
If you need a reasoning partner for complex, multi-domain problems: Kimi k2.5's Agent Swarm is the most capable system available, though it requires significant compute.
*
If you need reliable, enterprise-grade reasoning: Claude Opus 4.6's Adaptive Thinking makes it the most practical choice for professional workflows.
*
If you need raw math and coding power: Grok 4.2 is still dominant in STEM, even if it can't parse poetry.
*
If you need the best open-source option: DeepSeek R1 at 38.5% outperforms models 10ร its parameter count.
๐ How HLE Compares to Other Benchmarks
HLE isn't just another leaderboard โ it tests a fundamentally different capability than GPQA Diamond or SWE-bench:
| Benchmark | What It Tests | Saturation Point | HLE Correlation |
| :--- | :--- | :---: | :--- |
|
GPQA Diamond | PhD-level science reasoning | ~92% (Gemini 3 Pro) | Moderate โ HLE requires breadth, GPQA requires depth |
|
SWE-bench | Real-world software engineering | 82.1% (Claude Sonnet 5) | Low โ coding skill โ cross-domain reasoning |
|
MATH-500 | Advanced mathematics | 97.3% (DeepSeek R1) | Moderate โ HLE math section correlates |
|
LMArena Elo | Human preference in conversation | 1545 (GPT-5.2 Pro) | Low โ being "preferred" โ being "correct" |
|
HLE* | Expert-level cross-domain reasoning | *50.2% (Kimi k2.5) | โ |
The key insight:* A model can score 97% on MATH-500 yet fail HLE because it can't connect mathematical reasoning to historical context or philosophical nuance. HLE rewards the rarest trait in AI: *intellectual versatility.
---
๐ Related Analysis
*
Which models rank highest overall? See the full
February 2026 AI Benchmarks report
*
How does Grok 4.2 fare against Claude? Read our
Grok 4.2 vs Claude Opus 4.6 vs Sonnet 5 breakdown
*
What about value for money? DeepSeek R1 scored 38.5% at 1/10th the cost โ see the
DeepSeek R1 vs Grok 4.2 showdown
*
Who wins the human preference vote? Full Elo rankings in
LMArena Chatbot Arena Rankings 2026
---
โ Frequently Asked Questions
What does "passing" HLE actually mean?
Passing means scoring above 45% โ the threshold set by the Center for AI Safety. For context, a random-guessing baseline scores approximately 2% (since questions are free-form, not multiple choice). Any score above 40% indicates expert-level generalist competence.
Can I test my own model on HLE?
Yes. HLE is publicly available as a benchmark dataset. However, running it requires significant compute โ the full 3,000-question suite takes approximately 8 hours on a standard API setup with rate limits.
Why did Kimi k2.5 beat Claude Opus 4.6?
Kimi's
Agent Swarm architecture allowed it to decompose multi-domain questions into specialized sub-problems โ something Opus's single-model "Adaptive Thinking" couldn't replicate. For questions requiring expertise in two or more unrelated fields, the multi-agent approach was consistently superior.
Is HLE a good predictor of real-world usefulness?
Partially. HLE predicts performance on
novel, cross-domain problems โ the kind you encounter in academic research, legal analysis, and strategic consulting. However, for routine tasks like coding or data analysis, SWE-bench and LiveCodeBench are better predictors.
---
๐ฎ The Bigger Picture
In October 2025, experts predicted AI wouldn't pass HLE until 2027. It took
four months.
The Center for AI Safety is already preparing "HLE v2" with adversarial questions generated by the AI models themselves. The arms race between benchmark creators and model developers is accelerating faster than anyone anticipated.
One thing is certain: the era of "this test will stump AI for years" is over.
Compare all HLE-ranked models side by side on MangoMind. Access Kimi k2.5, Claude Opus 4.6, Grok 4.2, and 400+ other models in one workspace. Pay with bKash or Nagad starting at เงณ299/month.