🤖 AI Coding Benchmarks — April 2026

Researched by ollama/glm-5.1:cloud · April 14, 2026

Key Takeaways

GLM-5.1 emerges as the leading open-weight model: Achieves competitive 76.4% on SWE-bench (third-party verified) and dominates SWE-bench Pro with 58.4%, making it the top scorer for multi-language agentic coding.
Gemini 3.1 Pro leads on multiple verified benchmarks: Tops SWE-bench Verified (78.8%) and LiveCodeBench (2887 Elo), demonstrating Google's continued strength in coding capabilities.
GPT-5.4 shows strong performance with excellent speed: Competitive 78.2% on SWE-bench Verified with fastest-in-class latency (307s), making it ideal for time-sensitive applications.
The price-performance gap is dramatic: Gemma 4 31B costs only $0.13/$0.38 per 1M tokens compared to Opus 4.6 at $5/$25—a 38x price difference—while still delivering excellent performance on live programming benchmarks (80% on LiveCodeBench).
Chinese open models dominate: Z.AI (GLM), Moonshot (Kimi), MiniMax, and Alibaba (Qwen) all deliver state-of-the-art coding capabilities at 1/3rd to 1/10th the cost of proprietary models.
MiniMax M2.7 is the budget champion: At $0.30/$1.20 per 1M tokens with 73.8% SWE-bench Verified, it achieves performance nearly matching models costing 4-8x more.

The 13 Models Compared

Claude Opus 4.6 (Anthropic)

Price: $5/$25 per 1M tokens

Context: 1M tokens

Type: Proprietary

Claude Sonnet 4.6 (Anthropic)

Price: $3/$15 per 1M tokens

Context: 1M tokens

Type: Proprietary

GPT-5.4 (OpenAI)

Price: $2.50/$10.00 per 1M tokens

Context: 1M tokens

Type: Proprietary

GPT-5.3 Codex (OpenAI)

Price: $1.00/$3.00 per 1M tokens

Context: 1M tokens

Type: Proprietary

GLM-5.1 (Z.AI/Zhipu)

Price: $0.95/$3.15 per 1M tokens

Context: 200K tokens

Type: Open-source, 744B params

Kimi K2.5 (Moonshot AI)

Price: $0.38/$1.72 per 1M tokens

Context: 262K tokens

Type: Open-source, 1T params

MiniMax M2.7 (MiniMax)

Price: $0.30/$1.20 per 1M tokens

Context: 196K tokens

Type: Open-source, 230B+ params

MiniMax M2.5 Lightning (MiniMax)

Price: $0.30/$1.20 per 1M tokens

Context: 196K tokens

Type: Open-source, 230B+ params

Qwen 3.5 397B-A17B (Alibaba)

Price: $0.39/$2.34 per 1M tokens

Context: 262K tokens

Type: Open-source, MoE 397B total/17B active

Gemma 4 31B (Google)

Price: $0.13/$0.38 per 1M tokens

Context: 262K tokens

Type: Open-source, 31B dense

Gemini 3.1 Pro (Google)

Price: $0.50/$2.00 per 1M tokens

Context: 2M tokens

Type: Proprietary

DeepSeek V3.2 (DeepSeek)

Price: $0.20/$0.80 per 1M tokens

Context: 256K tokens

Type: Open-source, 236B MoE

SWE-bench Verified

Third-party evaluation by Vals.ai — April 2026. The gold standard for measuring AI performance on real-world software engineering tasks.

Model	Accuracy	Cost/test	Latency
Gemini 3.1 Pro	78.8%	$0.78	312s
GPT-5.4	78.2%	$0.80	307s
Opus 4.6 (Thinking)	78.2%	$1.22	351s
GPT-5.3 Codex	78.0%	$0.46	247s
Sonnet 4.6	77.4%	$1.30	512s
GLM-5.1 (Thinking)	76.4%	$0.46	527s
MiniMax M2.5 Lightning	74.2%	$0.46	403s
MiniMax M2.7	73.8%	$0.47	886s

Note: Kimi K2.5, Qwen 3.5 397B, and Gemma 4 31B have no third-party SWE-bench Verified runs.

Self-Reported SWE-bench Verified

Manufacturer-reported scores for transparency. Third-party scores take precedence when available.

Model	Score	Notes
Opus 4.6	80.8%	Anthropic self-report
Gemini 3.1 Pro	80.6%	Google self-report
MiniMax M2.5	80.2%	MiniMax self-report
Sonnet 4.6	79.6%	Anthropic self-report
Kimi K2.5	76.8%	Moonshot self-report
MiniMax M2.7	~78%	MiniMax claim; Vals.ai measured 73.8%
Qwen 3.5 397B	N/A	No SWE-bench Verified published
Gemma 4 31B	N/A	No SWE-bench Verified published

SWE-bench Pro

Multi-language agentic coding benchmark. Measures comprehensive software engineering capabilities across different programming languages and complex scenarios.

Model	Score	Notes
GLM-5.1	58.4%	New #1 (April 2026)
GPT-5.4	57.7%	Previous #1
Opus 4.6	~57.3%
MiniMax M2.7	56.22%	Matches GPT-5.3 Codex
Gemini 3.1 Pro	54.2%
MiniMax M2.5	51.3%	Multi-SWE-Bench

LiveCodeBench (Competitive Programming)

Real-time competitive programming benchmark. Measures coding ability under time pressure with novel problems.

Model	Score	Notes
Gemini 3.1 Pro	2887 Elo	#1 on LCB Pro
Kimi K2.5	85%
DeepSeek V3.2	83.3%	Reference only
Gemma 4 31B	80%	Massive jump from Gemma 3 27B's 29.1%

Terminal-Bench 2.0 (DevOps/CLI)

Measures AI assistants' ability to handle command-line operations, shell scripting, and DevOps tasks.

Model	Score
GPT-5.4	75.1%
Gemini 3.1 Pro	68.5%
Opus 4.6	65.4%
MiniMax M2.7	57.0%

Other Notable Benchmarks

Gemini 3.1 Pro

SWE-bench Verified #1 — 78.8% (third-party verified)
LiveCodeBench — 2887 Elo (#1 on LCB Pro)
Terminal-Bench 2.0 — 68.5%
SWE-bench Pro — 54.2%

GPT-5.4

SWE-bench Verified — 78.2% with fastest latency (307s)
SWE-bench Pro — 57.7% (previous #1)
Terminal-Bench 2.0 — 75.1% (DevOps/CLI leader)

GPT-5.3 Codex

SWE-bench Verified — 78.0%
Cost-effective performance at $0.46/test

GLM-5.1

KernelBench Level 3 — 3.6× geometric mean speedup
NL2Repo leader — excels at repository-level understanding
Can sustain 8-hour autonomous coding sessions

Gemma 4 31B

AIME 2026 — 89.2%
Codeforces ELO — 2150
τ²-bench (agentic tool use) — 86.4%
GPQA Diamond — 84.3%

Qwen 3.5 397B

Arena AI chat preference — 1449 ± 6
Plus variant offers 1M context and built-in tools

DeepSeek V3.2

LiveCodeBench — 83.3% (reference benchmark)
Strong competitive programming performance

MiniMax M2.5 Lightning

SWE-bench Verified — 74.2% (third-party)
High performance-to-cost ratio

MiniMax M2.7

VIBE-Pro — 55.6% (end-to-end project delivery)
SWE Multilingual — 76.5%
Multi-SWE-Bench — 52.7%
Claims OpenClaw usage approaching Sonnet 4.6 on MMClaw

Kimi K2.5

Front-end development specialist
Open-source
256K context

Pricing at a Glance (OpenRouter)

Actual costs per 1M tokens. The dramatic price differences highlight the value proposition of newer open models.

Model	Input/1M	Output/1M	Context	Type
Gemma 4 31B	$0.13	$0.38	262K	Open
Gemini 3.1 Pro	$0.50	$2.00	2M	Proprietary
GLM-5.1	$0.95	$3.15	200K	Open
MiniMax M2.7	$0.30	$1.20	196K	Open
Qwen 3.5 397B	$0.39	$2.34	262K	Open
Kimi K2.5	$0.38	$1.72	262K	Open
GLM-5.1	$0.95	$3.15	200K	Open
Sonnet 4.6	$3.00	$15.00	1M	Proprietary
Opus 4.6	$5.00	$25.00	1M	Proprietary

Sources

• Vals.ai SWE-bench — Third-party benchmark verification
• MorphLLM coding comparison — Model performance comparisons
• GLM-5.1 documentation — Official GLM-5.1 technical details
• GLM-5.1 VentureBeat — GLM-5.1 release coverage
• MiniMax M2.7 announcement — Official M2.7 model release
• MiniMax M2.7 HuggingFace — Model weights and documentation
• Kimi K2.5 MorphLLM — Kimi K2.5 performance analysis
• Gemma 4 benchmarks — Comprehensive Gemma 4 analysis
• Qwen 3.5 vs Gemma 4 — Head-to-head comparison
• BuildFastWithAI April 2026 — Monthly model roundup
• Sonnet 4.6 review — Detailed Sonnet analysis
• SWE-bench leaderboard — Comprehensive benchmark rankings
• Arena AI leaderboard — Open-source model comparisons
• OpenRouter pricing — Current API costs

← Back to Alan Zoppa's GitHub Pages