Skip to main content

ELL Blog

AI SimpleQA Leaderboard

Benchmark Descriptions

SimpleQA is a benchmark to grade the factuality of an LLM. I wrote this post because while writing my upcoming AI Awesome List, I realized there was no readily available webpage indexed by Google showing a leaderboard for SimpleQA.

It’s now September, and I am demanding a better benchmark. Personally I think asking AI research based questions with shorter answers is a good start. For example, when asking AI about best housing policies, it likes to shotgun answer you instead of succinctly stating that the best housing policies are upzoning and speeding up permitting. Of course, I disagree and personally I believe restoring foreign capital and cutting taxes such as HST/GST for all primary home buyers, and cutting developer charges are the most effective policies to implement today.

Linkup had this to say about factuality:

Our evaluations show that, when it comes to factuality, internet connectivity is more important than model size.

AIME'25 leaderboard: Math

I’ve removed Chatbot Arena and ArenaHard as DeepSeek R1 05/28 is really good at following instructions and sets the expectations fairly high. Instead I’ve added Humanity’s Last Exam.

Before reading this table, please take note of OpenAI’s comment regarding that the grading rubric itself cannot handle thorough exploration. My takeaway is that scores above 90% cannot be compared with each other.

ChatGPT agent scores lower on SimpleQA accuracy than o3 did. Manual investigation revealed cases where ChatGPT agent’s more thorough approach to research surfaced potential flaws in our grading rubric that were not apparent to o3, such as instances in which Wikipedia may contain inaccurate information. We are considering updates to this evaluation.

Model / Product Company Tier SimpleQA AIME'25 Humanity’s Last Exam
Liner Pro Reasoning Liner I P 95.30 N/A N/A
Exa Research Pro Exa I P 94.9% N/A N/A
Perplexity Deep Research Perplexity I P 93.90 N/A N/A
Liner Pro Liner I P 93.70 N/A N/A
Brave Multiple Searches Brave I P 93.25 N/A N/A
Exa Research Exa I P 91.6% N/A N/A
ChatGPT Agent System Card 91.4% OpenAI I P N/A N/A
o3 with browsing 95.4% (read note) OpenAI I P N/A N/A
Brave Single Search Brave Search I P 90.78 N/A N/A
Perplexity Pro Perplexity I P 90.60 N/A N/A
Brave Single Search + Reasoning Brave Search I P 90.5 N/A N/A
Linkup Web Search Linkup I P 90.10 N/A N/A
ODS-v2+DeepSeek-R1 Open Deep Search I M 88.3% N/A N/A
Perplexity Sonar Pro Perplexity I P 85.80 N/A N/A
Claude-4-Opus Anthropic I M - 75.5% -
ChatGPT-4.5 OpenAI I M 62.50 - -
ChatGPT-5-thinking OpenAI I M 55% - 50
Gemini-2.5-Pro Google I M 54.00 86.70 -
Claude-3.7-Sonnet Anthropic I M 50.00 - 59.8
o3 OpenAI I M 49.4 88.9 85.9
Grok 3 xAI I M 44.60 93.3 -
o1 OpenAI I M 42.60 79.20 61
ChatGPT-4.1 OpenAI I M 41.60 - 50
ChatGPT-4o OpenAI I M 39.00 14.00 -
Kimi K2 Moonshot AI I M 31.0 - -
DeepSeek-R1 (01/20) DeepSeek I M 30.10 70.00 8.5
DeepSeek-R1 (05/28) DeepSeek I M 27.80 87.50 17.7
DeepSeek-R1-0528-Qwen3-8B DeepSeek I M - 76.3 -
Gemini-2.5-Flash Google IV 29.70 78.00 -
Claude-3.5-Sonnet Anthropic I M 28.4 - 33
DeepSeek-V3 DeepSeek I M 24.9 - -
o4-mini OpenAI I M 20.20 92.70 79.1
o3-mini OpenAI I M 13.80 86.5 66.1
Qwen3-235B-A22B Qwen I M 15.00 81.5 95.6
Gemma 3 27B Google II 10.00 - -
Gemma 2 27B Google II 9.20 - -
Qwen3-32B (Dense) Qwen II 8.00 72.9 -
Qwen3-30B-A3B (MoE) Qwen II 8.00 70.9 -
EXAONE-Deep-32B LG II - 80 -
Qwen3-14B Qwen II - - -
EXAONE-Deep-7.8B LG II - 76.7 -
Qwen3-8B Qwen II - - -
EXAONE-Deep-2.4B LG II - 73.3 -
Apriel-Nemotron-15B-Thinker NVIDIA / SERVICE NOW II - 60.0 -
Gemma 3 12B Google II 6.30 - -
Gemma 3n Google II - - -
Gemma 3 4B Google III 4.00 - -
Gemma 2 9B Google II 5.30 - -
Phi 4 Reasoning Plus Microsoft II 3.00 78.00 -
Gemma 2 2B Google III 2.80 - -
Gemma 3 1B Google III 2.20 - -
Qwen3 4B Qwen III 1.00 65.6 -

Notes

Missing Models

  • Mistral Medium 3
  • DeepSeek V3 03/24
  • Google Gemini 2.5 Pro May update
  • llama 4 models (e.g. Llama 4 Behemoth)

For Claude 4, there was no score available for SimpleQA, however someone tested Opus on a subset and it scored the highest.

Definition of Tier

Tier Name
I M Flagship Model
I P Flagship Product
II Consumer hardware
III Edge hardware
IV Speed

More Benchmarks

https://www.swebench.com/#verified

https://lastexam.ai/

References

https://x.com/scaling01/status/1926017718286782643/photo/1

https://openai.com/index/introducing-o3-and-o4-mini/

https://gr.inc/OpenAI/SimpleQA/

https://qwenlm.github.io/blog/qwen3/

https://x.com/nathanhabib1011/status/1917230699582751157/photo/1

https://github.com/openai/simple-evals/tree/main

https://livecodebench.github.io/leaderboard.html

https://deepmind.google/technologies/gemini/flash/

https://matharena.ai/?utm_campaign=Data%20Points&utm_source=hs_email&utm_medium=email

https://github.com/lmarena/arena-hard-auto?tab=readme-ov-file#leaderboard

https://www.vals.ai/benchmarks/aime-2025-03-11

https://openai.com/index/introducing-simpleqa/

https://huggingface.co/microsoft/phi-4

https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/

https://x.ai/news/grok-3

https://brave.com/blog/ai-grounding/