Datacurve releases the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories: GPT-5.5 leads at 70%, GPT-5.4 got 56%, and Opus 4.7 got 54%
For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same.
VentureBeat Michael Nuñez
Related Coverage
- DeepSWE: Measuring frontier coding agents on original, long-horizon engineering tasks DeepSWE
- DeepSWE: Measuring frontier coding agents on original, long-horizon engineering tasks Datacurve on GitHub
- DeepSWE: A contamination-free benchmark for long-horizon coding agents Hacker News
- Huawei's New Benchmark Gives AI Agents Months of Your Life—Then Watches Them Fail Decrypt · Jose Antonio Lanz
Discussion
-
@serenaa_ge
Serena Ge
on x
Today we're releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. [image…
-
@theo
@theo
on x
This is the first code bench that actually aligns with how it feels to use these models coding.
-
@scaling01
@scaling01
on x
New coding benchmark. GPT-5.5 and GPT-5.4 are ahead of Opus 4.7 💀
-
@andersonbcdefg
Ben
on x
official confirmation that the claude code harness has become slop btw [image]
-
@mweinbach
Max Weinbach
on x
This roughly feels the most accurate for how models perform on agentic coding Just looking at it, that's what I've noticed as well for nearly every model
-
@gabrielchua
Gabriel Chua
on x
5.5 xhigh goes brrrrr
-
@nickadobos
Nick Dobos
on x
First correct benchmark I've seen in a while
-
@garrytan
Garry Tan
on x
This is the new standard for engineering evals [image]
-
@chrisgpt
Chris
on x
wait a minute 💀 they made a benchmark to test whether coding agents can handle real long horizon engineering work - repo understanding, multi file edits, tool use, debugging loops, test feedback, and keeping the system coherent across the whole task and GPT 5.5 is already at [ima…
-
@scaling01
@scaling01
on x
> they built a “NEW” coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking “when will the AI bubble pop?”
-
@jxnlco
Jason
on x
wow evals caught up to vibes
-
@scaling01
@scaling01
on x
DeepSeek bros have to lock in [image]
-
@antirez
@antirez
on x
New long frontier coding task benchmark. Didn't yet checked the actual tests performed but looks interesting and more companies doing tests = good.
-
@kimmonismus
@kimmonismus
on x
It's truly amazing to see how the general sentiment has shifted in favor of Codex. I'm reading so many posts saying that Codex is really good now with GPT-5.5, and that Claude Code is regularly preferred. (I've become a huge Codex fan myself). At the same time, the new DeepSWE [i…
-
r/LocalLLaMA
r
on reddit
New DeepSWE benchmark finds Claude Opus cheats
-
r/claude
r
on reddit
Claude cheated at SWEBench Pro by checking git history to copy/paste solutions. Underperforms OpenAI at DeepSWEBench