LMArena Chatbot Arena Rankings 2026: The Complete Elo Leaderboard (Updated Feb)
#1 AI Platform in Bangladesh
2026-02-22 | Benchmarks
LMArena Chatbot Arena Rankings 2026: The Complete Elo Leaderboard
Benchmarks can be gamed. Companies cherry-pick results. Marketing teams spin numbers.
But there's one leaderboard that's nearly impossible to fake:
LMArena (formerly LMSYS Chatbot Arena).
Here, real humans compare AI models in
blind, head-to-head battles — without knowing which model is which. After millions of votes, the Elo rankings emerge. This is the closest thing AI has to a "ground truth" of quality.
Here's the complete February 2026 leaderboard.
---
🏆 The Overall Elo Leaderboard (February 2026)
| Rank | Model | Elo Score | Change vs Dec '25 | Tier |
| :---: | :--- | :---: | :---: | :--- |
| 🥇 1 |
GPT-5.2 Pro* (OpenAI) | *1545 | — | S |
| 🥈 2 |
Gemini 3 Pro* (Google) | *1520 | ↑ +38 | S |
| 🥉 3 |
Claude Opus 4.6* (Anthropic) | *1510 | ↑ +25 | S |
| 4 |
GPT-5.2 (OpenAI) | 1495 | +5 | S |
| 5 |
Claude Opus 4.5 (Anthropic) | 1488 | -12 | A+ |
| 6 |
Grok 4.2 (xAI) | ~1450* | NEW | A+ |
| 7 |
Grok 4.1 (xAI) | 1483 | +8 | A+ |
| 8 |
Gemini 2.5 Pro (Google) | 1440 | -15 | A |
| 9 |
Claude Sonnet 5 (Anthropic) | 1435 | NEW | A |
| 10 |
DeepSeek R1 (DeepSeek) | ~1380 | +12 | A |
| 11 |
Kimi k2.5 (Moonshot AI) | ~1370 | NEW | A |
| 12 |
Mistral Large 3 (Mistral) | 1355 | +5 | A- |
| 13 |
Llama 4 Maverick (Meta) | 1340 | NEW | A- |
| 14 |
GLM 4.7 (Zhipu AI) | 1325 | +10 | B+ |
| 15 |
Qwen3 Max (Alibaba) | 1310 | +8 | B+ |
\*Grok 4.2 Elo estimated from early voting data — full convergence pending.
---
📊 Category Breakdown: Who Wins Where?
LMArena doesn't just rank overall quality. It breaks down into
specialized categories where different models dominate:
Coding Arena
| Rank | Model | Coding Elo |
| :---: | :--- | :---: |
| 🥇 |
Claude Sonnet 5* | *1580 |
| 🥈 |
Claude Opus 4.6 | 1545 |
| 🥉 |
GPT-5.2 Pro | 1520 |
| 4 |
DeepSeek R1 | 1465 |
| 5 |
Grok 4.2 | 1440 |
Math & Reasoning Arena
| Rank | Model | Math Elo |
| :---: | :--- | :---: |
| 🥇 |
GPT-5.2 Pro* | *1560 |
| 🥈 |
Claude Opus 4.6 | 1535 |
| 🥉 |
DeepSeek R1 | 1510 |
| 4 |
Gemini 3 Pro | 1505 |
| 5 |
Kimi k2.5 | 1480 |
Creative Writing Arena
| Rank | Model | Creative Elo |
| :---: | :--- | :---: |
| 🥇 |
Claude Opus 4.6* | *1555 |
| 🥈 |
GPT-5.2 | 1530 |
| 🥉 |
Gemini 3 Pro | 1510 |
| 4 |
Grok 4.2 | 1475 |
| 5 |
Llama 4 Maverick | 1420 |
Multilingual Arena
| Rank | Model | Multilingual Elo |
| :---: | :--- | :---: |
| 🥇 |
Gemini 3 Pro* | *1550 |
| 🥈 |
GPT-5.2 | 1520 |
| 🥉 |
Qwen3 Max | 1495 |
| 4 |
Claude Opus 4.6 | 1480 |
| 5 |
GLM 4.7 | 1465 |
---
🔬 How LMArena Works (And Why It's Trustworthy)
Unlike corporate benchmarks, LMArena uses a simple system:
1.
You submit a prompt to the arena
2.
Two anonymous models generate responses side-by-side
3.
You vote for the better response (or tie)
4.
The system updates Elo ratings using the same algorithm as chess rankings
Key safeguards:
*
Blind testing — you never know which model is which until after voting
*
Millions of votes — enough data to overcome individual bias
*
Regular recalibration — new models are added and re-ranked continuously
*
Anti-gaming — model providers can't optimize for specific arena prompts because they're user-generated
---
📈 The Biggest Movers: December 2025 → February 2026
| Model | Dec '25 Elo | Feb '26 Elo | Change | What Happened |
| :--- | :---: | :---: | :---: | :--- |
|
Gemini 3 Pro* | 1482 | 1520 | *+38 | Massive multimodal upgrade pushed human preference |
|
Claude Opus 4.6* | 1485 | 1510 | *+25 | Adaptive Thinking makes responses feel more natural |
|
DeepSeek R1 | ~1368 | ~1380 | +12 | Post-training improvements to conversational style |
|
Claude Opus 4.5* | 1500 | 1488 | *-12 | Users migrating votes to Opus 4.6 |
|
Gemini 2.5 Pro* | 1455 | 1440 | *-15 | Outshined by its own successor |
The Story of February:
Gemini 3 Pro's surge (+38) is the headline. Google's multimodal integration — native video, audio, and image understanding — gives it a "wow factor" in arena battles that pure text models can't match. When a user asks "what's in this image and write a poem about it," Gemini delivers noticeably richer responses.
Claude Opus 4.6 (+25) benefits from Adaptive Thinking, which makes its responses feel more thoughtful. Voters consistently prefer the "pause and think" quality over instant but shallow responses.
---
💡 What This Means for Choosing Your AI
| If You Value | Choose | LMArena Ranking |
| :--- | :--- | :--- |
|
Best overall quality | GPT-5.2 Pro | #1 (1545 Elo) |
|
Best for coding | Claude Sonnet 5 | #1 Coding (1580 Elo) |
|
Best for creative work | Claude Opus 4.6 | #1 Creative (1555 Elo) |
|
Best multilingual | Gemini 3 Pro | #1 Multilingual (1550 Elo) |
|
Best value | DeepSeek R1 | #10 Overall but 1/10th the cost |
---
🔗 Deep Dives by Model
Want to go deeper on specific models in the leaderboard?
*
Grok 4.2 vs Claude: Head-to-head comparison with real-world tests
*
DeepSeek R1 vs Grok 4.2: Price vs performance analysis
*
Best for coding specifically: SWE-bench Champions Ranked
*
Who's the smartest? Humanity's Last Exam Results
*
Full benchmark deep dive: February 2026 AI Benchmarks
---
❓ Frequently Asked Questions
Is LMArena the same as LMSYS Chatbot Arena?
Yes. LMSYS Chatbot Arena was rebranded to
LMArena (sometimes written as "LM Arena") in early 2026. The underlying system — blind head-to-head comparisons with Elo ratings — remains identical.
How many votes does a model need for a reliable Elo?
Approximately
10,000+ battles are needed for Elo to converge within ±10 points. New models like Grok 4.2 are still accumulating votes, which is why their Elo is marked as estimated (~1450*).
Why is GPT-5.2 Pro #1 but Claude Sonnet 5 beats it in coding?
LMArena's
overall Elo weighs all task types equally — creative writing, general knowledge, reasoning, coding, etc. GPT-5.2 Pro excels across all categories. Sonnet 5 is the specialist champion in coding but weaker in creative and conversational tasks. Category-specific Elo tables reveal the true picture.
Can I use LMArena to test models?
Yes! Visit
lmarena.ai to submit your own prompts and vote. Your votes contribute to the live leaderboard.
Which model has the best Elo for the price?
DeepSeek R1 at ~1380 Elo with $7.20/1M output offers the best Elo-per-dollar ratio. Grok 4.1 Fast at $0.50/1M output is even cheaper but has a lower Elo (~1420 estimated).
---
Try blind model comparisons yourself on MangoMind — access GPT-5.2, Gemini 3, Claude, Grok, and 400+ models in one workspace. Pay with bKash or Nagad starting at ৳299/month.