AI SimpleQA Leaderboard
Benchmark Descriptions
SimpleQA is a benchmark to grade the factuality of an LLM. I wrote this post because while writing my upcoming AI Awesome List post, I realized there was no readily available webpage indexed by Google showing a leaderboard for SimpleQA. The goal of this post is that if someone searches “SimpleQA Leaderboard” on Google, this webpage will show up. Over time I will add more benchmarks.
Linkup had this to say about factuality:
Our evaluations show that, when it comes to factuality, internet connectivity is more important than model size.
AIME'25 leaderboard: Math
I’ve removed Chatbot Arena and ArenaHard as DeepSeek R1 05/28 is really good at following instructions and sets the expectations fairly high. Instead I’ve added Humanity’s Last Exam.
Model / Product | Company | Tier | SimpleQA | AIME'25 | Humanity’s Last Exam |
---|---|---|---|---|---|
Liner Pro Reasoning | Liner | I P | 95.30 | N/A | N/A |
Liner Pro | Liner | I P | 93.70 | N/A | N/A |
Perplexity Deep Research | Perplexity | I P | 93.90 | N/A | N/A |
Perplexity Pro | Perplexity | I P | 90.60 | N/A | N/A |
Linkup Web Search | Linkup | I P | 90.10 | N/A | N/A |
Exa | Exa | I P | 90.04 | N/A | N/A |
Perplexity Sonar Pro | Perplexity | I P | 85.80 | N/A | N/A |
Claude-4-Opus | Anthropic | I M | - | 75.5% | - |
ChatGPT-4.5 | OpenAI | 1 M | 62.50 | - | - |
Gemini-2.5-Pro | I M | 54.00 | 86.70 | - | |
Claude-3.7-Sonnet | Anthropic | I M | 50.00 | - | 59.8 |
o3 | OpenAI | I M | 49.4 | 88.9 | 85.9 |
Grok 3 | xAI | I M | 44.60 | 93.3 | - |
o1 | OpenAI | I M | 42.60 | 79.20 | 61 |
ChatGPT-4.1 | OpenAI | I M | 41.60 | - | 50 |
ChatGPT-4o | OpenAI | I M | 39.00 | 14.00 | - |
DeepSeek-R1 (01/20) | DeepSeek | I M | 30.10 | 70.00 | 8.5 |
DeepSeek-R1 (05/28) | DeepSeek | I M | 27.80 | 87.50 | 17.7 |
DeepSeek-R1-0528-Qwen3-8B | DeepSeek | I M | - | 76.3 | - |
Gemini-2.5-Flash | IV | 29.70 | 78.00 | - | |
Claude-3.5-Sonnet | Anthropic | I M | 28.4 | - | 33 |
DeepSeek-V3 | DeepSeek | I M | 24.9 | - | - |
o4-mini | OpenAI | I M | 20.20 | 92.70 | 79.1 |
o3-mini | OpenAI | I M | 13.80 | 86.5 | 66.1 |
Qwen3-235B-A22B | Qwen | I M | 11.00 | 81.5 | 95.6 |
Gemma 3 27B | II | 10.00 | - | - | |
Gemma 2 27B | II | 9.20 | - | - | |
Qwen3-32B (Dense) | Qwen | II | 8.00 | 72.9 | - |
Qwen3-30B-A3B (MoE) | Qwen | II | 8.00 | 70.9 | - |
EXAONE-Deep-32B | LG | II | - | 80 | - |
Qwen3-14B | Qwen | II | - | - | - |
EXAONE-Deep-7.8B | LG | II | - | 76.7 | - |
Qwen3-8B | Qwen | II | - | - | - |
EXAONE-Deep-2.4B | LG | II | - | 73.3 | - |
Apriel-Nemotron-15B-Thinker | NVIDIA / SERVICE NOW | II | - | 60.0 | - |
Gemma 3 12B | II | 6.30 | - | - | |
Gemma 3n | II | - | - | - | |
Gemma 3 4B | III | 4.00 | - | - | |
Gemma 2 9B | II | 5.30 | - | - | |
Phi 4 Reasoning Plus | Microsoft | II | 3.00 | 78.00 | - |
Gemma 2 2B | III | 2.80 | - | - | |
Gemma 3 1B | III | 2.20 | - | - | |
Qwen3 4B | Qwen | III | 1.00 | 65.6 | - |
Notes
Missing Models
- Mistral Medium 3
- DeepSeek V3 03/24
- Google Gemini 2.5 Pro May update
- llama 4 models (e.g. Llama 4 Behemoth)
For Claude 4, there was no score available for SimpleQA, however someone tested Opus on a subset and it scored the highest.
Definition of Tier
Tier | Name |
---|---|
I M | Flagship Model |
I P | Flagship Product |
II | Consumer hardware |
III | Edge hardware |
IV | Speed |
More Benchmarks
https://www.swebench.com/#verified
References
https://x.com/scaling01/status/1926017718286782643/photo/1
https://openai.com/index/introducing-o3-and-o4-mini/
https://gr.inc/OpenAI/SimpleQA/
https://qwenlm.github.io/blog/qwen3/
https://x.com/nathanhabib1011/status/1917230699582751157/photo/1
https://github.com/openai/simple-evals/tree/main
https://livecodebench.github.io/leaderboard.html
https://deepmind.google/technologies/gemini/flash/
https://matharena.ai/?utm_campaign=Data%20Points&utm_source=hs_email&utm_medium=email
https://github.com/lmarena/arena-hard-auto?tab=readme-ov-file#leaderboard
https://www.vals.ai/benchmarks/aime-2025-03-11
https://openai.com/index/introducing-simpleqa/
https://huggingface.co/microsoft/phi-4
https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/