Datacurve releases the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories: GPT-5.5 leads at 70%, GPT-5.4 got 56%, and Opus 4.7 got 54%

For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same.

VentureBeat 2026-05-27 Michael Nuñez

Context & Ripple Effects

Earlier coverage produced a mixed picture of OpenAI’s coding progress: GPT-5 was reported to improve practical engineering and reasoning, while some developers still preferred Claude models for code generation. OpenAI later emphasized GPT-5.5’s gains in agentic coding and longer-context work.

DeepSWE adds a repository-spanning test across five languages to that arc and reports a wider separation among leading models than prior benchmark narratives implied. The result also lands as GPT-5.5 carries input and output prices above GPT-5.4’s.

First-order effects

GPT-5.5 gains a concrete performance claim over GPT-5.4 and Opus 4.7 on Datacurve’s 113-task DeepSWE evaluation, giving OpenAI a stronger benchmark-based case for coding deployments.
Enterprise buyers evaluating agentic software-engineering tools now have an additional test focused on work across open-source repositories, rather than treating top models as interchangeable.

Second-order effects

The reported gap raises pressure on rival model providers to demonstrate performance on repository-level tasks, not only on established coding benchmarks or developer anecdotes.
For buyers, GPT-5.5’s higher token pricing makes the decision less about benchmark leadership alone: teams will need to compare the added task-completion performance with inference cost in their own engineering workflows.

Third-order effects

If repository-spanning evaluations such as DeepSWE gain adoption, coding-model competition may shift toward measurable end-to-end engineering reliability, with benchmark design becoming more consequential for procurement and model positioning.
The result underscores an emerging segmentation in AI coding: models may differentiate by their ability to sustain long-context, multi-step work, rather than by isolated code-generation quality; whether DeepSWE becomes a durable reference remains unproven.

The trend: AI coding is moving from single-task generation benchmarks toward evaluations of agentic, repository-level software engineering, where capability gains must be weighed against operating cost.

Discussion

@serenaa_ge Serena Ge on x
Today we're releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. [image…
@theo @theo on x
This is the first code bench that actually aligns with how it feels to use these models coding.
@scaling01 @scaling01 on x
New coding benchmark. GPT-5.5 and GPT-5.4 are ahead of Opus 4.7 💀
@andersonbcdefg Ben on x
official confirmation that the claude code harness has become slop btw [image]
@mweinbach Max Weinbach on x
This roughly feels the most accurate for how models perform on agentic coding Just looking at it, that's what I've noticed as well for nearly every model
@gabrielchua Gabriel Chua on x
5.5 xhigh goes brrrrr
@nickadobos Nick Dobos on x
First correct benchmark I've seen in a while
@garrytan Garry Tan on x
This is the new standard for engineering evals [image]
@chrisgpt Chris on x
wait a minute 💀 they made a benchmark to test whether coding agents can handle real long horizon engineering work - repo understanding, multi file edits, tool use, debugging loops, test feedback, and keeping the system coherent across the whole task and GPT 5.5 is already at [ima…
@scaling01 @scaling01 on x
> they built a “NEW” coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking “when will the AI bubble pop?”
@jxnlco Jason on x
wow evals caught up to vibes
@scaling01 @scaling01 on x
DeepSeek bros have to lock in [image]
@antirez @antirez on x
New long frontier coding task benchmark. Didn't yet checked the actual tests performed but looks interesting and more companies doing tests = good.
@kimmonismus @kimmonismus on x
It's truly amazing to see how the general sentiment has shifted in favor of Codex. I'm reading so many posts saying that Codex is really good now with GPT-5.5, and that Claude Code is regularly preferred. (I've become a huge Codex fan myself). At the same time, the new DeepSWE [i…
r/LocalLLaMA r on reddit
New DeepSWE benchmark finds Claude Opus cheats
r/claude r on reddit
Claude cheated at SWEBench Pro by checking git history to copy/paste solutions. Underperforms OpenAI at DeepSWEBench

Chronicles