/
Navigation
Chronicles
Browse all articles
Explore
Semantic exploration
Research
Entity momentum
Nexus
Correlations & relationships
Story Arc
Topic evolution
Drift Map
Semantic trajectory animation
Posts
Analysis & commentary
Pulse API
Tech news intelligence API
Browse
Entities
Companies, people, products, technologies
Domains
Browse by publication source
Handles
Browse by social media handle
Detection
Concept Search
Semantic similarity search
High Impact Stories
Top coverage by position
Sentiment Analysis
Positive/negative coverage
Anomaly Detection
Unusual coverage patterns
Analysis
Rivalry Report
Compare two entities head-to-head
Semantic Pivots
Narrative discontinuities
Crisis Response
Event recovery patterns
Connected
Search: /
Command: ⌘K
Embeddings: large
TEXXR

Chronicles

The story behind the story

days · browse · Enter similar · o open

Datacurve releases the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories: GPT-5.5 leads at 70%, GPT-5.4 got 56%, and Opus 4.7 got 54%

For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same.

VentureBeat Michael Nuñez

Discussion

  • @serenaa_ge Serena Ge on x
    Today we're releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. [image…
  • @theo @theo on x
    This is the first code bench that actually aligns with how it feels to use these models coding.
  • @scaling01 @scaling01 on x
    New coding benchmark. GPT-5.5 and GPT-5.4 are ahead of Opus 4.7 💀
  • @andersonbcdefg Ben on x
    official confirmation that the claude code harness has become slop btw [image]
  • @mweinbach Max Weinbach on x
    This roughly feels the most accurate for how models perform on agentic coding Just looking at it, that's what I've noticed as well for nearly every model
  • @gabrielchua Gabriel Chua on x
    5.5 xhigh goes brrrrr
  • @nickadobos Nick Dobos on x
    First correct benchmark I've seen in a while
  • @garrytan Garry Tan on x
    This is the new standard for engineering evals [image]
  • @chrisgpt Chris on x
    wait a minute 💀 they made a benchmark to test whether coding agents can handle real long horizon engineering work - repo understanding, multi file edits, tool use, debugging loops, test feedback, and keeping the system coherent across the whole task and GPT 5.5 is already at [ima…
  • @scaling01 @scaling01 on x
    > they built a “NEW” coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking “when will the AI bubble pop?”
  • @jxnlco Jason on x
    wow evals caught up to vibes
  • @scaling01 @scaling01 on x
    DeepSeek bros have to lock in [image]
  • @antirez @antirez on x
    New long frontier coding task benchmark. Didn't yet checked the actual tests performed but looks interesting and more companies doing tests = good.
  • @kimmonismus @kimmonismus on x
    It's truly amazing to see how the general sentiment has shifted in favor of Codex. I'm reading so many posts saying that Codex is really good now with GPT-5.5, and that Claude Code is regularly preferred. (I've become a huge Codex fan myself). At the same time, the new DeepSWE [i…
  • r/LocalLLaMA r on reddit
    New DeepSWE benchmark finds Claude Opus cheats
  • r/claude r on reddit
    Claude cheated at SWEBench Pro by checking git history to copy/paste solutions.  Underperforms OpenAI at DeepSWEBench