AI SimpleQA Leaderboard
Benchmark Descriptions
SimpleQA is a benchmark to grade the factuality of an LLM. I wrote this post because while writing my upcoming AI Awesome List, I realized there was no readily available webpage indexed by Google showing a leaderboard for SimpleQA.
It’s now September, and I am demanding a better benchmark. Personally I think asking AI research based questions with shorter answers is a good start. For example, when asking AI about best housing policies, it likes to shotgun answer you instead of succinctly stating that the best housing policies are upzoning and speeding up permitting. Of course, I disagree and personally I believe restoring foreign capital and cutting taxes such as HST/GST for all primary home buyers, and cutting developer charges are the most effective policies to implement today.
Linkup had this to say about factuality:
Our evaluations show that, when it comes to factuality, internet connectivity is more important than model size.
AIME'25 leaderboard: Math
I’ve removed Chatbot Arena and ArenaHard as DeepSeek R1 05/28 is really good at following instructions and sets the expectations fairly high. Instead I’ve added Humanity’s Last Exam.
Before reading this table, please take note of OpenAI’s comment regarding that the grading rubric itself cannot handle thorough exploration. My takeaway is that scores above 90% cannot be compared with each other.
ChatGPT agent scores lower on SimpleQA accuracy than o3 did. Manual investigation revealed cases where ChatGPT agent’s more thorough approach to research surfaced potential flaws in our grading rubric that were not apparent to o3, such as instances in which Wikipedia may contain inaccurate information. We are considering updates to this evaluation.
Model / Product | Company | Tier | SimpleQA | AIME'25 | Humanity’s Last Exam |
---|---|---|---|---|---|
Liner Pro Reasoning | Liner | I P | 95.30 | N/A | N/A |
Exa Research Pro | Exa | I P | 94.9% | N/A | N/A |
Perplexity Deep Research | Perplexity | I P | 93.90 | N/A | N/A |
Liner Pro | Liner | I P | 93.70 | N/A | N/A |
Brave Multiple Searches | Brave | I P | 93.25 | N/A | N/A |
Exa Research | Exa | I P | 91.6% | N/A | N/A |
ChatGPT Agent System Card | 91.4% | OpenAI | I P | N/A | N/A |
o3 with browsing | 95.4% (read note) | OpenAI | I P | N/A | N/A |
Brave Single Search | Brave Search | I P | 90.78 | N/A | N/A |
Perplexity Pro | Perplexity | I P | 90.60 | N/A | N/A |
Brave Single Search + Reasoning | Brave Search | I P | 90.5 | N/A | N/A |
Linkup Web Search | Linkup | I P | 90.10 | N/A | N/A |
ODS-v2+DeepSeek-R1 | Open Deep Search | I M | 88.3% | N/A | N/A |
Perplexity Sonar Pro | Perplexity | I P | 85.80 | N/A | N/A |
Claude-4-Opus | Anthropic | I M | - | 75.5% | - |
ChatGPT-4.5 | OpenAI | I M | 62.50 | - | - |
ChatGPT-5-thinking | OpenAI | I M | 55% | - | 50 |
Gemini-2.5-Pro | I M | 54.00 | 86.70 | - | |
Claude-3.7-Sonnet | Anthropic | I M | 50.00 | - | 59.8 |
o3 | OpenAI | I M | 49.4 | 88.9 | 85.9 |
Grok 3 | xAI | I M | 44.60 | 93.3 | - |
o1 | OpenAI | I M | 42.60 | 79.20 | 61 |
ChatGPT-4.1 | OpenAI | I M | 41.60 | - | 50 |
ChatGPT-4o | OpenAI | I M | 39.00 | 14.00 | - |
Kimi K2 | Moonshot AI | I M | 31.0 | - | - |
DeepSeek-R1 (01/20) | DeepSeek | I M | 30.10 | 70.00 | 8.5 |
DeepSeek-R1 (05/28) | DeepSeek | I M | 27.80 | 87.50 | 17.7 |
DeepSeek-R1-0528-Qwen3-8B | DeepSeek | I M | - | 76.3 | - |
Gemini-2.5-Flash | IV | 29.70 | 78.00 | - | |
Claude-3.5-Sonnet | Anthropic | I M | 28.4 | - | 33 |
DeepSeek-V3 | DeepSeek | I M | 24.9 | - | - |
o4-mini | OpenAI | I M | 20.20 | 92.70 | 79.1 |
o3-mini | OpenAI | I M | 13.80 | 86.5 | 66.1 |
Qwen3-235B-A22B | Qwen | I M | 15.00 | 81.5 | 95.6 |
Gemma 3 27B | II | 10.00 | - | - | |
Gemma 2 27B | II | 9.20 | - | - | |
Qwen3-32B (Dense) | Qwen | II | 8.00 | 72.9 | - |
Qwen3-30B-A3B (MoE) | Qwen | II | 8.00 | 70.9 | - |
EXAONE-Deep-32B | LG | II | - | 80 | - |
Qwen3-14B | Qwen | II | - | - | - |
EXAONE-Deep-7.8B | LG | II | - | 76.7 | - |
Qwen3-8B | Qwen | II | - | - | - |
EXAONE-Deep-2.4B | LG | II | - | 73.3 | - |
Apriel-Nemotron-15B-Thinker | NVIDIA / SERVICE NOW | II | - | 60.0 | - |
Gemma 3 12B | II | 6.30 | - | - | |
Gemma 3n | II | - | - | - | |
Gemma 3 4B | III | 4.00 | - | - | |
Gemma 2 9B | II | 5.30 | - | - | |
Phi 4 Reasoning Plus | Microsoft | II | 3.00 | 78.00 | - |
Gemma 2 2B | III | 2.80 | - | - | |
Gemma 3 1B | III | 2.20 | - | - | |
Qwen3 4B | Qwen | III | 1.00 | 65.6 | - |
Notes
Missing Models
- Mistral Medium 3
- DeepSeek V3 03/24
- Google Gemini 2.5 Pro May update
- llama 4 models (e.g. Llama 4 Behemoth)
For Claude 4, there was no score available for SimpleQA, however someone tested Opus on a subset and it scored the highest.
Definition of Tier
Tier | Name |
---|---|
I M | Flagship Model |
I P | Flagship Product |
II | Consumer hardware |
III | Edge hardware |
IV | Speed |
More Benchmarks
https://www.swebench.com/#verified
References
https://x.com/scaling01/status/1926017718286782643/photo/1
https://openai.com/index/introducing-o3-and-o4-mini/
https://gr.inc/OpenAI/SimpleQA/
https://qwenlm.github.io/blog/qwen3/
https://x.com/nathanhabib1011/status/1917230699582751157/photo/1
https://github.com/openai/simple-evals/tree/main
https://livecodebench.github.io/leaderboard.html
https://deepmind.google/technologies/gemini/flash/
https://matharena.ai/?utm_campaign=Data%20Points&utm_source=hs_email&utm_medium=email
https://github.com/lmarena/arena-hard-auto?tab=readme-ov-file#leaderboard
https://www.vals.ai/benchmarks/aime-2025-03-11
https://openai.com/index/introducing-simpleqa/
https://huggingface.co/microsoft/phi-4
https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/