arcprize · TEXXR

Gemini 3.1 Pro on ARC-AGI Semi-Private Eval @GoogleDeepMind - ARC-AGI-1: 98%, $0.52/task - ARC-AGI-2: 77%, $0.96/task Gemini to push the Pareto Frontier of performance and efficiency [image]

2026-02-21 View on X

9to5Google

Google rolls out Gemini 3.1 Pro, which it says is “a step forward in core reasoning”, for all users in the Gemini app; the .1 increment is a first for Google

View original

Gemini 3.1 Pro on ARC-AGI Semi-Private Eval @GoogleDeepMind - ARC-AGI-1: 98%, $0.52/task - ARC-AGI-2: 77%, $0.96/task Gemini to push the Pareto Frontier of performance and efficiency [image]

2026-02-20 View on X

9to5Google

Google rolls out Gemini 3.1 Pro, which it says is “a step forward in core reasoning”, for all users in the Gemini app; the .1 increment is a first for Google

In November, Google introduced Gemini 3 Pro in preview, with Gemini 3 Flash following a month later.

View original

Gemini 3.1 Pro on ARC-AGI Semi-Private Eval @GoogleDeepMind - ARC-AGI-1: 98%, $0.52/task - ARC-AGI-2: 77%, $0.96/task Gemini to push the Pareto Frontier of performance and efficiency [image]

2026-02-19 View on X

9to5Google

Google rolls out Gemini 3.1 Pro, which it says is “a step forward in core reasoning”, for all users in the Gemini app; the .1 increment is a first for Google

In November, Google introduced Gemini 3 Pro in preview, with Gemini 3 Flash following a month later.

View original

Claude Sonnet 4.6 (120K Thinking) on ARC-AGI Semi-Private Eval @AnthropicAI Max Effort: - ARC-AGI-1: 86%, $1.45/task - ARC-AGI-2: 58% $2.72/task [image]

2026-02-18 View on X

Anthropic

Anthropic launches Claude Sonnet 4.6 with improvements in coding, computer use, instruction following, and more; it features a 1M token context window in beta

Claude Sonnet 4.6 is our most capable Sonnet model yet. It's a full upgrade of the model's skills across coding, computer use …

View original

Claude Sonnet 4.6 (120K Thinking) on ARC-AGI Semi-Private Eval @AnthropicAI Max Effort: - ARC-AGI-1: 86%, $1.45/task - ARC-AGI-2: 58% $2.72/task [image]

2026-02-17 View on X

Anthropic

Anthropic launches Claude Sonnet 4.6 with improvements in coding, consistency, and more, for Free and Pro users; it features a 1M token context window in beta

Claude Sonnet 4.6 is our most capable Sonnet model yet. It's a full upgrade of the model's skills across coding, computer use …

View original

Gemini 3 Deep Think (2/26) Semi Private Eval - ARC-AGI-1: 96.0%, $7.17/task - ARC-AGI-2: 84.6% $13.62/task New ARC-AGI SOTA model from @GoogleDeepMind [image]

2026-02-13 View on X

The Keyword

Google updates Gemini 3 Deep Think to better solve modern science, research, and engineering challenges and expands it via the Gemini API to some researchers

Our most specialized reasoning mode is now updated to solve modern science, research and engineering challenges.

View original

Gemini 3 Deep Think (2/26) Semi Private Eval - ARC-AGI-1: 96.0%, $7.17/task - ARC-AGI-2: 84.6% $13.62/task New ARC-AGI SOTA model from @GoogleDeepMind [image]

2026-02-12 View on X

The Keyword

Google updates Gemini 3 Deep Think to better solve modern science, research, and engineering challenges and expands it via the Gemini API to some researchers

Our most specialized reasoning mode is now updated to solve modern science, research and engineering challenges.

View original

Gemini 3 models from @Google @GoogleDeepMind have made a significant 2X SOTA jump on ARC-AGI-2 (Semi-Private Eval) Gemini 3 Pro: 31.11%, $0.81/task Gemini 3 Deep Think (Preview): 45.14%, $77.16/task [image]

2025-11-18 View on X

The Verge

Google unveils Gemini 3, its “most intelligent” and “factually accurate” model yet, with improvements across coding and reasoning, and offering less “flattery”

The flagship Gemini 3 Pro model is coming to the Gemini app and Search, with improvements across coding, reasoning, and less ‘flattery.’

View original

GPT-5 on ARC-AGI Semi Private Eval GPT-5 * ARC-AGI-1: 65.7%, $0.51/task * ARC-AGI-2: 9.9%, $0.73/task GPT-5 Mini * ARC-AGI-1: 54.3%, $0.12/task * ARC-AGI-2: 4.4%, $0.20/task GPT-5 Nano * ARC-AGI-1: 16.5%, $0.03/task * ARC-AGI-2: 2.5%, $0.03/task [image]

2025-08-08 View on X

VentureBeat

OpenAI touts GPT-5's scores on math, coding, and health benchmarks: 94.6% on AIME 2025 without tools, 74.9% on SWE-bench Verified, and 46.2% on HealthBench Hard

After literally years of hype and speculation, OpenAI has officially launched a new lineup of large language models (LLMs) …

View original

Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA [image]

2025-07-10 View on X

@artificialanlys

Artificial Analysis benchmarks: Grok 4 is now the leading AI model, a first for xAI; Grok 4's per-token pricing is more expensive than Gemini 2.5 Pro's and o3's

xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis...

View original

Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA [image]

2025-07-10 View on X

Tom's Guide

xAI introduces Grok 4, trained on its Colossus supercomputer, with multimodal features, faster reasoning, Grok 4 Voice, Grok 4 Code, a new interface, and more

Deeper thinking and greater reasoning is promised — An hour after the live stream was supposed to start last night (July 9) …

View original

GPT-4.1 on ARC-AGI's Semi Private Evaluation GPT-4.1: * ARC-AGI-1: 5.5% ($0.039/tsk) * ARC-AGI-2: 0.0% ($0.069/tsk) GPT-4.1-Mini: * ARC-AGI-1: 3.5% ($0.0078/tsk) * ARC-AGI-2: 0.0% ($0.0139/tsk) GPT-4.1-Nano: * ARC-AGI-1: 0.0% ($0.0021/tsk) * ARC-AGI-2: 0.0% ($0.0036/tsk) [image]

2025-04-15 View on X

TechCrunch

OpenAI releases GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano, which excel at coding, instruction following, and long context understanding, available via its API

OpenAI on Monday launched a new family of models called GPT-4.1. Yes, “4.1” — as if the company's nomenclature wasn't confusing enough already.

View original

Today we are announcing ARC-AGI-2, an unsaturated frontier AGI benchmark that challenges AI reasoning systems (same relative ease for humans). Grand Prize: 85%, ~$0.42/task efficiency Current Performance: * Base LLMs: 0% * Reasoning Systems: <4% [image]

2025-03-26 View on X

TechCrunch

The Arc Prize Foundation says its new ARC-AGI-2 test stumps most AI models; humans get 60% of the questions right but GPT-4.5 and Claude 3.7 Sonnet score ~1%

[image] François Chollet / @fchollet : Unlike ARC-AGI-1, this new version is not easily brute-forced. Current top AI approaches score 0-4%. All base LLMs (GPT-4.5, Claude 3.7 Son...

View original

New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4 [image]

2024-12-22 View on X

TechCrunch

OpenAI unveils o3 and o3-mini, trained to “think” before responding via what OpenAI calls a “private chain of thought”, and plans to launch them in early 2025

12 Days of OpenAI: Day 12 Naomi Li Gan / Tech in Asia : OpenAI unveils AI model for advanced reasoning Bojan Stojkovski / Interesting Engineering : OpenAI unveils o3 reasoning AI m...

View original

New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4 [image]

2024-12-21 View on X

TechCrunch

OpenAI unveils o3 and o3-mini, trained to “think” before responding via what OpenAI calls a “private chain of thought”, and plans to launch them in early 2025

OpenAI announced its new o3 models on Friday. — In a tweet ahead of its final livestream for its …

View original