AI SimpleQA Leaderboard

May 2, 2025

https://blog.elijahlopez.ca/posts/ai-simpleqa-leaderboard/ Elijah Lopez

Benchmark Descriptions

SimpleQA is a benchmark to grade the factuality of an LLM. I wrote this post because while writing my upcoming AI Awesome List post, I realized there was no readily available webpage indexed by Google showing a leaderboard for SimpleQA. The goal of this post is that if someone searches “SimpleQA Leaderboard” on Google, this webpage will show up. Over time I will add more benchmarks.

Linkup had this to say about factuality:

Our evaluations show that, when it comes to factuality, internet connectivity is more important than model size.

AIME'25 leaderboard: Math

I’ve removed Chatbot Arena and ArenaHard as DeepSeek R1 05/28 is really good at following instructions and sets the expectations fairly high. Instead I’ve added Humanity’s Last Exam.

Model / Product	Company	Tier	SimpleQA	AIME'25	Humanity’s Last Exam
Liner Pro Reasoning	Liner	I P	95.30	N/A	N/A
Liner Pro	Liner	I P	93.70	N/A	N/A
Perplexity Deep Research	Perplexity	I P	93.90	N/A	N/A
Perplexity Pro	Perplexity	I P	90.60	N/A	N/A
Linkup Web Search	Linkup	I P	90.10	N/A	N/A
Exa	Exa	I P	90.04	N/A	N/A
Perplexity Sonar Pro	Perplexity	I P	85.80	N/A	N/A
Claude-4-Opus	Anthropic	I M	-	75.5%	-
ChatGPT-4.5	OpenAI	1 M	62.50	-	-
Gemini-2.5-Pro	Google	I M	54.00	86.70	-
Claude-3.7-Sonnet	Anthropic	I M	50.00	-	59.8
o3	OpenAI	I M	49.4	88.9	85.9
Grok 3	xAI	I M	44.60	93.3	-
o1	OpenAI	I M	42.60	79.20	61
ChatGPT-4.1	OpenAI	I M	41.60	-	50
ChatGPT-4o	OpenAI	I M	39.00	14.00	-
DeepSeek-R1 (01/20)	DeepSeek	I M	30.10	70.00	8.5
DeepSeek-R1 (05/28)	DeepSeek	I M	27.80	87.50	17.7
DeepSeek-R1-0528-Qwen3-8B	DeepSeek	I M	-	76.3	-
Gemini-2.5-Flash	Google	IV	29.70	78.00	-
Claude-3.5-Sonnet	Anthropic	I M	28.4	-	33
DeepSeek-V3	DeepSeek	I M	24.9	-	-
o4-mini	OpenAI	I M	20.20	92.70	79.1
o3-mini	OpenAI	I M	13.80	86.5	66.1
Qwen3-235B-A22B	Qwen	I M	11.00	81.5	95.6
Gemma 3 27B	Google	II	10.00	-	-
Gemma 2 27B	Google	II	9.20	-	-
Qwen3-32B (Dense)	Qwen	II	8.00	72.9	-
Qwen3-30B-A3B (MoE)	Qwen	II	8.00	70.9	-
EXAONE-Deep-32B	LG	II	-	80	-
Qwen3-14B	Qwen	II	-	-	-
EXAONE-Deep-7.8B	LG	II	-	76.7	-
Qwen3-8B	Qwen	II	-	-	-
EXAONE-Deep-2.4B	LG	II	-	73.3	-
Apriel-Nemotron-15B-Thinker	NVIDIA / SERVICE NOW	II	-	60.0	-
Gemma 3 12B	Google	II	6.30	-	-
Gemma 3n	Google	II	-	-	-
Gemma 3 4B	Google	III	4.00	-	-
Gemma 2 9B	Google	II	5.30	-	-
Phi 4 Reasoning Plus	Microsoft	II	3.00	78.00	-
Gemma 2 2B	Google	III	2.80	-	-
Gemma 3 1B	Google	III	2.20	-	-
Qwen3 4B	Qwen	III	1.00	65.6	-