AI Coding Leaderboard

May 5, 2025

https://blog.elijahlopez.ca/posts/ai-coding-leaderboard/ Elijah Lopez

DEPRECATED: Aider Polyglot is multi-lingual compared to SWE-bench verified and that does not require me to make use of my own aggregated leaderboard. I will still try my best to maintain this post, but just letting you know I suggest using Aider Polyglot going forward for which model to use for pair programming.

Aider polyglot: Based on the 225 most difficult Exercism coding problems in the following languages: C++, Go, Java, JavaScript, Python and Rust. See Aider polyglot coding leaderboard. The problem I see is that models can just train on the solutions and game this leaderboard versus LiveCodeBench which is free of contaminations.
SWE-bench verified (August 13, 2024). A subset of 500 Python-exclusive tasks from the full 2000 which have been verified by humans as solvable. This is the defacto way to evaluate AI models as almost almost all models will report their scores for this benchmark. The unofficial leaderboard includes SWE copilot tool. The problem is that it doesn’t include self-reported results unlike my leaderboard.
Codeforces: the elo represents how good the AI model is at competitive programming tasks. There is no leaderboard since most AI models will self-report or benchmark other models
The problem with EvalPlus is that it doesn’t include bleeding edge models, it’s basically almost solved, and not many new models even report their scores anymore.
I don’t like LiveCodeBench because it’s not useful for comparing different models released at different times due to (1) every-updating problem set (2) not continuously testing all frontier models. If a benchmark score has an expiry date, then referencing it is really bad for a leaderboard. It’s good for relative performance in papers however then you have to read the fine print to ensure the authors retested the models they compared their new model against. See how complicated that is? Then you’d have to compile your own leaderboard because can you really trust others to do it right? No you can’t! I may include some sort of LiveCodeBench relative ranking in the future.

Future: Replace AIder Polyglot with SWE-bench Multilingual

Ideally, the best programming tool would be implemented as follows, given the release of grok-1-fast:

An Architecture <-> Programmer workflow where an agent is in charge of responding to questions that the coder agent has. The thinker will be the one to plan out the project and break it into chunks for the programer to implement.

Leaderboard

Model / Product	Company	Tier	Aider Polyglot	SWE-bench verified	Codeforces	Date Available
Grok 4 Fast¹	xAI	PI	-	-	-	Sep-2025
Claude Sonnet 4.5 ²	Anthropic	PI	-	77.2%	-	Sep-2025
GPT-5-Codex³	OpenAI	PI	-	74.5%	-	Sep-2025
GPT-5³	OpenAI	PI	88%	72.8%	-	Sep-2025
Codex-1⁴	OpenAI	PI	-	72.1%	-	May-2025
Claude Opus 4.1⁵	Anthropic	PI	-	74.5%	-	Aug-2025
Claude Sonnet 4 ⁶	Anthropic	PI	-	72.7%	-	May-2025
Claude Opus 4⁶	Anthropic	PI	72.5%	72.5%	-	May-2025
Grok Code Fast 1⁷	xAI	PI	-	70.8%	-	Aug-2025
Claude Sonnet 3.7 (Custom Scaffold)⁸	Anthropic	PI	-	70.3%	-	Feb-2025
Gemini 2.5 Pro⁹	Google	PI	-	67.2%	-	May-2025
Qwen 3 Coder 480B¹⁰	Qwen	OI	61.8%	69.6%	-	Jul-2025
o3-pro¹¹	OpenAI	PI	84.9%	69.0%	2748	Apr-2025
o3¹¹	OpenAI	PI	79.6%	69.1%	2706	Apr-2025
Qwen 3 Max¹²	Qwen	PI	-	69.6%	-	Sep-2025
DeepSeek-V3.1-Terminus¹³	DeepSeek	OI	76.1%	68.4%	-	Sep-2025
o4-mini¹¹	OpenAI	PI	72.0%	68.1%	2719	Apr-2025
DeepSeek-V3.2-Exp¹⁴	DeepSeek	OI	74.5%	67.8%	-	Sep-2025
GLM-4.6¹⁵	Z AI	OI	-	68%	-	Sep-2025
DeepSeek-V3.1¹⁶	DeepSeek	OI	-	66.0%	-	Aug-2025
Code World Model¹⁷	Meta	OIII	-	65.8 %	-	Sep-2025
Kimi-K2-Instruct¹⁸	Moonshot AI	OI	60%	65.8%	-	Jul-2025
GLM-4.5¹⁹	Z AI	OI	-	64.2%	-	Jul-2025
gpt-oss-120B²⁰	OpenAI	OII	44.4%	62.4%	2622	Aug-2025
Gemini 2.5 Flash Thinking²¹	OpenAI	PI	61.9%	60.4%	-	Aug-2025
Gemini 2.5 Pro²²	Google	PI	82.2%	59.6%	2001	May-2025
GLM-4.5-Air ¹⁹	Z AI	OII	-	57.6%	-	Jul-2025
Claude 3.7 Sonnet⁸	Anthropic	OI	64.9%	62.3%	-	Feb-2025
o3-mini¹¹	OpenAI	PI	-	61.0%	2036	Apr-2025
gpt-oss-20B²⁰	OpenAI	OII	34.2%	60.7%	2516	Aug-2025
DeepSeek-R1-0528 (2025)²³	DeepSeek	OI	-	57.60% (Multiple Attempts)	1930	May-2025
Qwen3-235B-A22B²⁴	Qwen	OI	-	52.2¹²	2056	Apr-2025
Grok 3²⁵	xAI	PI	53.3%	-	-	Feb-2025
DeepSeek R1 (01/20)²⁶	DeepSeek	OI	56.9%	49.2%	1530	Jan-2025
DeepSeek V3 (03/24)¹⁶	DeepSeek	OI	-	45.4%	-	Mar-2025
Mistral Medium 3²⁷	Mistarl	PI	-	-	-	May-2025
ChatGPT 4.1²⁸	OpenAI	PI	-	55%	-	Apr-2025
Claude 3.5 Sonnet²⁹	Anthropic	PI	-	49%	-	Oct-2024
o1	OpenAI	PI	-	48.9%	1891	Dec-2024
Qwen3-32B²⁴	Qwen	OII	-	-	1977	Apr-2025
Qwen3-30B-A3B²⁴	Qwen	OII	-	-	1974	Apr-2025
Qwen3-14B	Qwen	OIII	-	-	-	Apr-2025
Qwen3-8B	Qwen	OIII	-	-	-	Apr-2025
~~DeepSeek-R1-Distill-70B~~	DeepSeek	OII	-	-	1633	Jan-2025
Phi 4 reasoning (14B)	Microsoft	OIII	-	-	1736	Dec-2025
Phi 4 reasoning plus	Microsoft	OIII	-	-	1723	Dec-2025
Qwen3-4B²⁴	Qwen	OIII	-	-	1671	Apr-2025
Gemma3-27B-IT	Google	OII	-	-	1063	Mar-2025
DeepSeek V3 (12/24)³⁰	DeepSeek	OI	-	42%	-	Dec-2024
Claude 3.5 Haiku³¹	Anthropic	PI	-	40.6%	-	Oct-2024
~~o1-preview~~	OpenAI	PI	-	40%	1258	Sep-2024
ChatGPT 4.5³²	OpenAI	PI	-	38.0%	-	Feb-2025
Claude 3.5 Sonnet (old)²⁹	Anthropic	PI	-	33.4%	-	Jun-2024
ChatGPT 4o³²	OpenAI	PI	-	33.2%	900	May-2024
o1-mini³³	OpenAI	PI	-	30%	1650	Sep-2024
Claude 3 Opus²⁹	Anthropic	PI	-	22%	-	Mar-2024

Definition of Tier

Tier	Name
P	Proprietary
O	Open Weights
I	Flagship
II	Dedicated Hardware (32GB < VRAM <= 144 GB)
III	<= 32GB VRAM
IV	Speed

When I say dedicated hardware, I’m talking about having to buy hardware dedicated for running AI models locally. The best bang for you buck available as far as I know is tiny box or Truffle with the former costing over 10k and the latter costing under 20k. I mean at that price point, unless you’re doing something sensitive, it’s better to use API endpoints via openrouter.

Notes

Google Gemini 2.5 Pro May update is missing from the leaderboard
I’m disappointed in the Grok team at xAI for failing to release benchmarks that can be compared against future frontier models (Gemini 2.5 Pro was released in March 2025, o4-mini was released in April 2025, and Qwen3 in April 2025). Their benchmark only proves that Grok 3 Beta and Grok 3 mini Beta (THINK) outperform o3-mini. They only really used LiveCodeBench which like I said before, is not a good way to do comparison across time. On LiveCodeBench, Grok 3 Beta scored higher than o3-mini and Deepseek-R1, but nothing really to compare against Gemini 2.5 Pro, which is the frontier model that got released after Grok 3. Even when Gemini 2.5 Pro was released, Grok 3 Beta’s SWE-bench verified score was and is still -. Amongst all the models Gemini 2.5 Pro compared against, Grok 3 Beta was the one with the most missing benchmarks (6). And that’s with the Gemini team excluding Claude 3.7’s estimated SimpleQA score of ~50.
I put R1 above o1 because R1 was better at software engineering not just programming. The example I can give is asking about fixing a race condition regarding writing to a file that needs to be read. R1 properly told me to use fsync (note: call F_FULLFSYNC for Darwin platforms)
The codeforces scores change depending on when they are recorded. This is the case for DeepSeek R1, which got a bump of ~150. I did this to ensure that Qwen3 models that don’t contain a SWE-bench verified score are not ranked above DeepSeek R1. I also removed Claude 3.5 Sonnet’s Codeforces score because there was apparently a (new) model released for it which has clearly not been benchmarked by the new models. It’s a bit weird that the new Qwen release does not benchmark claude 3.7.
Explaining Synthetic Rankings of models with missing scores.
- According to Mistral, Mistral Medium 3 performs just below Claude 3.7 and ranks DeepSeek V3 (03/24) above Claude 3.7 according to their LiveCodeBench benchmarking. However, until I see an updated SWE-bench verified on Mistral Medium 3, it will be placed below DeepSeek V3
- Qwen3-235B-A22B and Grok 3 are only placed so high due to outperforming o3-mini and DeepSeek R1 at the LiveCodeBench. Surprisingly, Gemini 2.5 Pro has the highest LiveCodeBench score according to Qwen’s blog. Therefore, it makes sense to ignore the CodeForces when computing the synthetic ranking for Qwen3-235B-A22B and Grok 3. Reasoning models are also worse for productivity because they take longer to respond.

Interpreting the Leaderboard

2025-05-05

Models that are prohibitively expensive: o3 (high), claude 3.7 + thinking.

In my opinion, given that Claude 3.7 sucks at following simple instructions and behaves like an intern when it modifies code it was told not to edit, the models I recommend as of comes down to:

o3 (too expensive?)
o4-mini
gemini-2.5-pro-preview

Conclusions

Why can’t I get paid to do this? If I was paid money, I would HAPPILY run benchmarks on all frontier models and maintain the leaderboards for all the necessary models. Hell I would even make a fancy UI and everything instead of just a static markdown table that can’t be sorted.

EvalPlus is deprecated; read more in the coding leaderboard
LCB scores are static and the depreciation mechanism makes it very difficult to compare a model that was just released to a model realeased a few months ago. For example, look at Grok 3. They exclusively use this benchmark, even though there’s a 99.99% chance that Grok 3’s future LCB score would be lower. This is exhibited well in Qwen’s blog which showed Gemini 2.5 Pro absolutely crushing LCB. The biggest issues for LCB is that if you have a restraint like “I only want to run open-source models less than 20B in size”, you will definitely not be able to benefit from LCB’s own published leaderboards.
xAI is releasing a suspicuiously low amount of benchmark scores. Not only that, but the xAI team has taken the approach that we all have patience. Their LCB score is useless to real world scenarios once you realize not only did it have to think to achieve them, gemini 2.5 pro beat it anyways. Not to mention that o4-mini and Gemini 2.5 Pro Preview were released on openrouter 7-8 days after grok 3 BETA was released on openrouter.
Qwen3 30B is a great model and has “deprecated” DeepSeek R1 Distill 70B

History of Releases

Anthropic: Claude 3 Opus scored scored 22% on SWE-bench verified, the paper was released later in August
OpenAI: ChatGPT 4o May 2025 scored 33.2% on SWE-bench verified They started off very strong in June 2024 with 3.5 Sonnet, but are slowly plateauing in performance
Anthropic: Claude 3.5 released in June, revolutionary even though on paper it was only 33.4%
Anthropic: Claude 3.5 released in June, revolutionary even though on paper it was only 33.4%
OpenAI: o1 preview and mini, the latter of which scores 30%, the former was scoring 40%.
Anthropic: October update Claude 3.5 Haiku scores 40.6% and the new Claude 3.5 Sonnet scores 49%, crowing it 2024’s best model
OpenAI: o1 released in Dec-2024 with a score of 48.9%
DeepSeek: V3 (open-weight) released in Dec-2024 with a score of 42%! 2025
DeepSeek: R1 (open) release in Jan-2025 with a score of 49.2%! Everyone started going crazy
Grok: Announces 3 Beta on 19th (still can’t find official 3 non-beta release blog post LOL) with some LCB scores
Anthropic: Sonnet 3.7 released on 24th; Raised the bar to the next level at 62.3%
OpenAI: disappointing chat gpt 4.5 release (Feb 27th)
Google: Fully releases Gemini 2.5 Pro in Mar-2025 with a score of 67.2%, also release Gemma3, but not as relevant
DeepSeek: V3 refresher
OpenAI: Strikes back with o3 in Apr-2025 with a score of 69.1%
Qwen: Qwen3 (open) released in Apr-2025, offers better performance for less size than DeepSeek R1 distills (made me very optimistic about open-weight models)
Anthropic: Claude 4 Sonnet comes in at 72.7% in may
DeepSeek: R1 refresh improves score to 57.6%
Grok: 4 release, leaked scores showing a “code” model scoring 75% on swe-bench verified. Removed from my leaderboard. They only compared to Gemini 2.5 Pro which was at that point “inferior” to both o3, and 4
Z, Kimi, Qwen: July was a win for open source with all three companies releasing models. Qwen 3 Coder 480B replaced R1 as the open SOTA score at 69.6%. Z’s GLM 4.5 Air showed that performance does not need to come with unattainable hardware.
August: OpenAI (open-source models, best performance at 20B - consumer level), 120B released redundant due to GLM 4.5 Air, Deepseek V3.1, Anthropic 4.1 opus SOTA 74.5%, Grok code-fast (very good performance for quick response time)
Septemeber: OpenAI codex + gpt5, Grok 4 Fast, V3.1 refresh

Anthropic started 2024 behind OpenAI but aggressively leapfrogged the competition multiple times to stay near the top

Qwen and DeepSeek reduced the performance gap. They are at the heels of proprietary companies. Open SOTA is 69.6% for swe-bench verified in July 2025 vs. 74%+ scores which came out in August and September 2025. If we go back to July 2025, only Anthropic was ahead at 72.5%.

OpenAI: Codex and gpt 5 is significant, but..

Grok: Grok Code Fast and Grok 4 shows that Grok team is changing direction and focusing on results and specialization rather than generalization. Their Code Fast models make them a company to take more seriously.

Google: Google seems to take it laid back (deserving so). The 2.5 Pro May update is not benchmarked as much but it keeps their model relevant. Google seems to focus on releasing models to maintain relevancy rather than cater to benchmark scores.

References

DeepSeek R1 claims 96.3th percentile which is [1989, 2095]³⁴ However OpenAI claims a score of 1891. Since DeepSeek reported that o1 has a higher codeforces percentile, we can assume that DeepSeek R1’s score should be strictly less than o1’s codeforces score, which is 1890 at best.

DeepSeek V3 claims 51.6th percentile and most recently 58.7th percentile. Given that R1 is below that percentile, I’ve used the lowest 51th percentile (assuming that the median barely moved up).

Grok 4 Fast scores higher than Grok 4 on the same LiveCodeBench (Jan-May 2025)³⁵ ↩︎
Introducing Claude Sonnet 4.5 ↩︎
Introducing Upgrades to Codex - OpenAI ↩︎ ↩︎
Introducing Codex ↩︎
Claude Opus 4.1 - Anthropic ↩︎
Introducing Claude 4 ↩︎ ↩︎
Grok Code Fast 1 News - xAI ↩︎
Claude 3.7 Sonnet - Anthropic ↩︎ ↩︎
Build rich, interactive web apps with an updated Gemini 2.5 Pro ↩︎
Qwen3 Coder - Agentic Coding Adventure (see picture beside “What Is Qwen3 Coder?”) ↩︎
Introducing O3 and O4 Mini - OpenAI ↩︎ ↩︎ ↩︎ ↩︎
Qwen3-Max: Just Scale it ↩︎ ↩︎
DeepSeek-V3.1-Terminus ↩︎
Introducing DeepSeek-V3.2-Exp ↩︎
GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilities ↩︎
DeepSeek-V3.1 ↩︎ ↩︎
CWM: An Open-Weights LLM for Research on Code Generation with World Models ↩︎
Kimi-K2 - Moonshot AI ↩︎
GLM-4.5: Reasoning, Coding, and Agentic Abililties ↩︎ ↩︎
OpenAI GPT-OSS Model Card - OpenAI ↩︎ ↩︎
Gemini Flash (2025-09-27) ↩︎
Gemini Model Thinking Updates March 2025 - Google DeepMind ↩︎
DeepSeek-R1-0528 Release ↩︎
Qwen3: Think Deeper, Act Faster ↩︎ ↩︎ ↩︎ ↩︎
Grok 3 Beta — The Age of Reasoning Agents ↩︎
DeepSeek API News 2025-01-20 - DeepSeek ↩︎
Mistral Medium 3 News - Mistral AI ↩︎
GPT-4.1 - OpenAI ↩︎
Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet ↩︎ ↩︎ ↩︎
Introducing DeepSeek-V3 ↩︎
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku ↩︎
Introducing GPT-4.5 - OpenAI ↩︎ ↩︎
O1 Mini: Advancing Cost-Efficient Reasoning - OpenAI ↩︎
2024 Codeforces Rating Distribution + rating percentiles ↩︎
Grok-4 and Grok-4 Code on benchmarks - x/@legit_api leak; Grok 4 has a higher LiveCodeBench score than all the other models for Jan 2025 to May 2025 ↩︎