πŸ€– AI Coding Benchmarks β€” April 2026

Researched by ollama/glm-5.1:cloud Β· April 14, 2026

Key Takeaways

The 13 Models Compared

Claude Opus 4.6 (Anthropic)

Price: $5/$25 per 1M tokens

Context: 1M tokens

Type: Proprietary

Claude Sonnet 4.6 (Anthropic)

Price: $3/$15 per 1M tokens

Context: 1M tokens

Type: Proprietary

GPT-5.4 (OpenAI)

Price: $2.50/$10.00 per 1M tokens

Context: 1M tokens

Type: Proprietary

GPT-5.3 Codex (OpenAI)

Price: $1.00/$3.00 per 1M tokens

Context: 1M tokens

Type: Proprietary

GLM-5.1 (Z.AI/Zhipu)

Price: $0.95/$3.15 per 1M tokens

Context: 200K tokens

Type: Open-source, 744B params

Kimi K2.5 (Moonshot AI)

Price: $0.38/$1.72 per 1M tokens

Context: 262K tokens

Type: Open-source, 1T params

MiniMax M2.7 (MiniMax)

Price: $0.30/$1.20 per 1M tokens

Context: 196K tokens

Type: Open-source, 230B+ params

MiniMax M2.5 Lightning (MiniMax)

Price: $0.30/$1.20 per 1M tokens

Context: 196K tokens

Type: Open-source, 230B+ params

Qwen 3.5 397B-A17B (Alibaba)

Price: $0.39/$2.34 per 1M tokens

Context: 262K tokens

Type: Open-source, MoE 397B total/17B active

Gemma 4 31B (Google)

Price: $0.13/$0.38 per 1M tokens

Context: 262K tokens

Type: Open-source, 31B dense

Gemini 3.1 Pro (Google)

Price: $0.50/$2.00 per 1M tokens

Context: 2M tokens

Type: Proprietary

DeepSeek V3.2 (DeepSeek)

Price: $0.20/$0.80 per 1M tokens

Context: 256K tokens

Type: Open-source, 236B MoE

SWE-bench Verified

Third-party evaluation by Vals.ai β€” April 2026. The gold standard for measuring AI performance on real-world software engineering tasks.

Model Accuracy Cost/test Latency
Gemini 3.1 Pro 78.8% $0.78 312s
GPT-5.4 78.2% $0.80 307s
Opus 4.6 (Thinking) 78.2% $1.22 351s
GPT-5.3 Codex 78.0% $0.46 247s
Sonnet 4.6 77.4% $1.30 512s
GLM-5.1 (Thinking) 76.4% $0.46 527s
MiniMax M2.5 Lightning 74.2% $0.46 403s
MiniMax M2.7 73.8% $0.47 886s

Note: Kimi K2.5, Qwen 3.5 397B, and Gemma 4 31B have no third-party SWE-bench Verified runs.

Self-Reported SWE-bench Verified

Manufacturer-reported scores for transparency. Third-party scores take precedence when available.

Model Score Notes
Opus 4.6 80.8% Anthropic self-report
Gemini 3.1 Pro 80.6% Google self-report
MiniMax M2.5 80.2% MiniMax self-report
Sonnet 4.6 79.6% Anthropic self-report
Kimi K2.5 76.8% Moonshot self-report
MiniMax M2.7 ~78% MiniMax claim; Vals.ai measured 73.8%
Qwen 3.5 397B N/A No SWE-bench Verified published
Gemma 4 31B N/A No SWE-bench Verified published

SWE-bench Pro

Multi-language agentic coding benchmark. Measures comprehensive software engineering capabilities across different programming languages and complex scenarios.

Model Score Notes
GLM-5.1 58.4% New #1 (April 2026)
GPT-5.4 57.7% Previous #1
Opus 4.6 ~57.3%
MiniMax M2.7 56.22% Matches GPT-5.3 Codex
Gemini 3.1 Pro 54.2%
MiniMax M2.5 51.3% Multi-SWE-Bench

LiveCodeBench (Competitive Programming)

Real-time competitive programming benchmark. Measures coding ability under time pressure with novel problems.

Model Score Notes
Gemini 3.1 Pro 2887 Elo #1 on LCB Pro
Kimi K2.5 85%
DeepSeek V3.2 83.3% Reference only
Gemma 4 31B 80% Massive jump from Gemma 3 27B's 29.1%

Terminal-Bench 2.0 (DevOps/CLI)

Measures AI assistants' ability to handle command-line operations, shell scripting, and DevOps tasks.

Model Score
GPT-5.4 75.1%
Gemini 3.1 Pro 68.5%
Opus 4.6 65.4%
MiniMax M2.7 57.0%

Other Notable Benchmarks

Gemini 3.1 Pro

  • SWE-bench Verified #1 β€” 78.8% (third-party verified)
  • LiveCodeBench β€” 2887 Elo (#1 on LCB Pro)
  • Terminal-Bench 2.0 β€” 68.5%
  • SWE-bench Pro β€” 54.2%

GPT-5.4

  • SWE-bench Verified β€” 78.2% with fastest latency (307s)
  • SWE-bench Pro β€” 57.7% (previous #1)
  • Terminal-Bench 2.0 β€” 75.1% (DevOps/CLI leader)

GPT-5.3 Codex

  • SWE-bench Verified β€” 78.0%
  • Cost-effective performance at $0.46/test

GLM-5.1

  • KernelBench Level 3 β€” 3.6Γ— geometric mean speedup
  • NL2Repo leader β€” excels at repository-level understanding
  • Can sustain 8-hour autonomous coding sessions

Gemma 4 31B

  • AIME 2026 β€” 89.2%
  • Codeforces ELO β€” 2150
  • τ²-bench (agentic tool use) β€” 86.4%
  • GPQA Diamond β€” 84.3%

Qwen 3.5 397B

  • Arena AI chat preference β€” 1449 Β± 6
  • Plus variant offers 1M context and built-in tools

DeepSeek V3.2

  • LiveCodeBench β€” 83.3% (reference benchmark)
  • Strong competitive programming performance

MiniMax M2.5 Lightning

  • SWE-bench Verified β€” 74.2% (third-party)
  • High performance-to-cost ratio

MiniMax M2.7

  • VIBE-Pro β€” 55.6% (end-to-end project delivery)
  • SWE Multilingual β€” 76.5%
  • Multi-SWE-Bench β€” 52.7%
  • Claims OpenClaw usage approaching Sonnet 4.6 on MMClaw

Kimi K2.5

  • Front-end development specialist
  • Open-source
  • 256K context

Pricing at a Glance (OpenRouter)

Actual costs per 1M tokens. The dramatic price differences highlight the value proposition of newer open models.

Model Input/1M Output/1M Context Type
Gemma 4 31B $0.13 $0.38 262K Open
Gemini 3.1 Pro $0.50 $2.00 2M Proprietary
GLM-5.1 $0.95 $3.15 200K Open
MiniMax M2.7 $0.30 $1.20 196K Open
Qwen 3.5 397B $0.39 $2.34 262K Open
Kimi K2.5 $0.38 $1.72 262K Open
GLM-5.1 $0.95 $3.15 200K Open
Sonnet 4.6 $3.00 $15.00 1M Proprietary
Opus 4.6 $5.00 $25.00 1M Proprietary

Sources

← Back to Alan Zoppa's GitHub Pages