AI SimpleQA Leaderboard
Benchmark Descriptions
SimpleQA is a benchmark to grade the factuality of an LLM. I wrote this post because while writing my upcoming AI Awesome List post, I realized there was no readily available webpage indexed by Google showing a leaderboard for SimpleQA. The goal of this post is that if someone searches “SimpleQA Leaderboard” on Google, this webpage will show up. Over time I will add more benchmarks.
Linkup had this to say about factuality:
Our evaluations show that, when it comes to factuality, internet connectivity is more important than model size.
AIME'25 leaderboard: Math
Chatbot Arena leaderboard: A leaderboard based on the results of users blind testing two models and picking the best response
ArenaHard leaderboard: An automated artificial/synthetic benchmark for Chatbot Arena
Model / Product | Company | Tier | SimpleQA | AIME'25 | ArenaHard | Chatbot Arena |
---|---|---|---|---|---|---|
Liner Pro Reasoning | Liner | I P | 95.30 | N/A | N/A | N/A |
Liner Pro | Liner | I P | 93.70 | N/A | N/A | N/A |
Perplexity Deep Research | Perplexity | I P | 93.90 | N/A | N/A | N/A |
Perplexity Pro | Perplexity | I P | 90.60 | N/A | N/A | N/A |
Linkup Web Search | Linkup | I P | 90.10 | N/A | N/A | N/A |
Exa | Exa | I P | 90.04 | N/A | N/A | N/A |
Perplexity Sonar Pro | Perplexity | I P | 85.80 | N/A | N/A | N/A |
Claude-4-Opus | Anthropic | I M | - | 75.5% | - | - |
ChatGPT-4.5 | OpenAI | 1 M | 62.50 | 1398 | ||
Gemini-2.5-Pro | I M | 52.90 | 86.70 | 1439 | ||
Claude-3.7-Sonnet | Anthropic | I M | 50.00 | 59.8 | 1292 | |
o3 | OpenAI | I M | 49.4 | 88.9 | 85.9 | 1418 |
Grok 3 | xAI | I M | 44.60 | 93.3 | 1402 | |
o1 | OpenAI | I M | 42.60 | 79.20 | 61 | 1350 |
ChatGPT-4.1 | OpenAI | I M | 41.60 | 50 | 1363 | |
ChatGPT-4o | OpenAI | I M | 39.00 | 14.00 | 1408 | |
DeepSeek-R1 | DeepSeek | I M | 32.84 | 80.00 | 48 | 1358 |
Gemini-2.5-flash | IV | 29.70 | 78.00 | 1393 | ||
Claude-3.5-Sonnet | Anthropic | I M | 28.4 | 33 | ||
DeepSeek-V3 | DeepSeek | I M | 24.9 | 1381 | ||
o4-mini | OpenAI | I M | 20.20 | 92.70 | 79.1 | 1351 |
o3-mini | OpenAI | I M | 13.80 | 86.5 | 66.1 | 1325 |
Qwen3-235B-A22B | Qwen | I M | 11.00 | 81.5 | 95.6 | |
Gemma 3 27B | II | 10.00 | 1342 | |||
Gemma 2 27B | II | 9.20 | 1220 | |||
Qwen3-32B (Dense) | Qwen | II | 8.00 | 72.9 | 93.8 | |
Qwen3-30B-A3B (MoE) | Qwen | II | 8.00 | 70.9 | 91 | |
EXAONE-Deep-32B | LG | II | - | 80 | - | - |
Qwen3-14B | Qwen | II | - | - | - | - |
EXAONE-Deep-7.8B | LG | II | - | 76.7 | - | - |
Qwen3-8B | Qwen | II | - | - | - | - |
EXAONE-Deep-2.4B | LG | II | - | 73.3 | - | - |
Apriel-Nemotron-15B-Thinker | NVIDIA / SERVICE NOW | II | - | 60.0 | - | - |
Gemma 3 12B | II | 6.30 | ||||
Gemma 3n | II | - | 1283 | |||
Gemma 3 4B | III | 4.00 | ||||
Gemma 2 9B | II | 5.30 | ||||
Phi 4 Reasoning Plus | Microsoft | II | 3.00 | 78.00 | 79.00 | |
Gemma 2 2B | III | 2.80 | ||||
Gemma 3 1B | III | 2.20 | ||||
Qwen3 4B | Qwen | III | 1.00 | 65.6 | 76.6 |
Notes
Missing Models
- Mistral Medium 3
- DeepSeek V3 03/24
- Google Gemini 2.5 Pro May update
- llama 4 models (e.g. Llama 4 Behemoth)
For Claude 4, there was no score available for SimpleQA, however someone tested Opus on a subset and it scored the highest.
Definition of Tier
Tier | Name |
---|---|
I M | Flagship Model |
I P | Flagship Product |
II | Consumer hardware |
III | Edge hardware |
IV | Speed |
More Benchmarks
https://www.swebench.com/#verified
References
https://x.com/scaling01/status/1926017718286782643/photo/1
https://openai.com/index/introducing-o3-and-o4-mini/
https://gr.inc/OpenAI/SimpleQA/
https://qwenlm.github.io/blog/qwen3/
https://x.com/nathanhabib1011/status/1917230699582751157/photo/1
https://github.com/openai/simple-evals/tree/main
https://livecodebench.github.io/leaderboard.html
https://deepmind.google/technologies/gemini/flash/
https://matharena.ai/?utm_campaign=Data%20Points&utm_source=hs_email&utm_medium=email
https://github.com/lmarena/arena-hard-auto?tab=readme-ov-file#leaderboard
https://www.vals.ai/benchmarks/aime-2025-03-11
https://openai.com/index/introducing-simpleqa/
https://huggingface.co/microsoft/phi-4
https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/