ARC-AGI V3 Explained: The New AI Benchmark That Breaks Every Agent

ARC-AGI V3 — The Benchmark That Breaks Every Agent

The Score That Should Have Everyone Worried

On March 28, 2026, the AI world got a number that should be printed on every AGI roadmap in a very large font: 0.3%.

That’s the score GPT-5.4 High and Claude Opus 4.6 Max — the two most capable AI systems on the planet — achieved on ARC-AGI V3. At a cost of $5,000 to $9,000 per task.

Humans? 100%.

Symbolica’s Agentica SDK? 36% — and a total bill of about $1,005 for 113 of 182 levels.

This isn’t a minor benchmark update. ARC-AGI V3 is the clearest signal yet that the AI industry has been solving the wrong problem — and has been celebrating its progress on a test that measures something fundamentally different from intelligence.

📊 The V3 Scoreboard
• Humans: 100% success rate
• Symbolica Agentica SDK: 36.08% (113/182 levels, $1,005 total)
• GPT-5.4 High: ~0.3% (at $5,000–9,000 per task)
• Claude Opus 4.6 Max: ~0.25–0.3% (similar cost profile)

What Is ARC-AGI, and Why Does It Keep Mattering?

The Abstraction and Reasoning Corpus (ARC) is the benchmark that François Chollet — Keras creator, Google DeepMind researcher, and arguably the AI field’s most credible skeptic — designed specifically to measure fluid intelligence rather than memorized knowledge.

The core insight: if a model has seen enough training examples, it can score well on almost any benchmark. ARC was designed from the start to resist this. Each puzzle is a visual pattern-matching challenge that requires genuine reasoning about novel rules, not retrieval of familiar patterns.

When ARC-AGI launched, it was near-impossible for AI. Then, over several years, frontier models crept upward. The race was on.

V1 → V2 → V3: Closing the Escape Hatches

ARC-AGI V1 (2019): The original benchmark. Static 2D grid puzzles. Given a few input-output examples, derive the transformation rule and apply it to a new input. Simple in presentation; brutally hard for AI.

ARC-AGI V2 (2025): Addressed the contamination problem. As V1 scores improved, critics argued models were pattern-matching against training data that included ARC-like examples. V2 introduced harder, more compositional puzzles with stricter novelty guarantees.

ARC-AGI V3 (2026): This is a category shift. V3 isn’t just harder static puzzles — it’s a completely different paradigm.

How ARC-AGI V3 Actually Works

V3 drops agents into interactive video game environments. Not static grids. Live mini-games.

Here’s what that means in practice:

The agent is presented with a novel mini-game it has never seen before
There are zero instructions — no explanation of the goal, no description of the controls, no hint about the rules
The agent has a limited number of turns to figure everything out
Success means: discover the goal, learn the controls, understand the rules, and complete the task — all from live experimentation

This is how humans learn to play new games. You pick up a controller, press buttons, observe what happens, form a model of the game’s logic, and iterate. A 10-year-old can master a new mobile game in minutes doing exactly this.

Current AI systems? They break.

The reason is fundamental: large language models are trained on vast amounts of text and data. They’ve absorbed enormous amounts of human knowledge. But they haven’t been trained to explore and discover in live, interactive environments — to form hypotheses, test them, observe results, and update their mental model in real time with no prior context.

That’s a different cognitive capability entirely. It’s what Chollet calls fluid intelligence: the ability to adapt to genuinely novel situations, not just apply learned patterns.

💡 Chollet's Core Thesis
LLMs didn't get smarter. They got better-trained on verifiable domains like code. Move to genuinely novel, non-verifiable tasks — and the apparent intelligence evaporates. Fluid intelligence, the capacity to handle truly new situations, remains the missing piece.

François Chollet’s Vision — and Warning

Chollet launched V3 alongside a characteristically direct set of claims about the state of AI. His YC interview from March 2026 is worth watching in full — it’s one of the clearest articulations of why benchmark performance keeps diverging from actual capability.

His key arguments:

1. LLMs improved on measurable tasks, not on intelligence itself. Coding benchmarks improved because code is verifiable and the internet is full of it. Models got better at the specific thing they were trained to do. This isn’t the same as getting smarter — it’s more efficient specialization.

2. Move to unverifiable domains and progress stalls. Strategy, open-ended reasoning, genuine novelty — all the things that require flexible thinking rather than pattern completion — remain largely untouched by scale. More parameters didn’t solve fluid intelligence.

3. AGI timeline: 2030, but not via the current path. Chollet believes we’ll get there, but the architecture of genuinely intelligent systems will look different. His estimate: the core fluid intelligence engine will fit in fewer than 10,000 lines of code. It’s not about scale. It’s about the right computational principles.

4. The ARC Prize exists to move the field’s incentives. With a $600K prize pool for systems that can match human performance on ARC-AGI, Chollet is trying to redirect research away from “get better at the benchmarks we already perform on” and toward “solve the fundamental unsolved problem.”

The Results in Context: What 0.3% Actually Means

Let’s sit with that number for a moment.

GPT-5.4 High, running at $5,000–9,000 per task, scored 0.3% on an evaluation where humans score 100%. This is not a narrow gap. This is not “almost there.” This is a gap so large it questions what we mean when we say these systems are intelligent.

For contrast: Symbolica’s Agentica SDK achieved 36% at roughly $9 per level ($1,005 total across 113/182 levels). That’s still far from human-level, but it’s at least in the same conceptual neighborhood. And it costs approximately 1,000x less than the frontier models that scored near-zero.

What’s Symbolica doing differently? Their Agentica SDK is built around program synthesis — the ability to construct explicit, compositional reasoning structures rather than relying on statistical pattern matching. This is closer to how fluid intelligence might actually work: building models, testing them, revising them based on evidence.

The 36% vs 0.3% comparison isn’t just a performance difference. It’s a signal about architecture. Pure LLM scaling doesn’t help here. Something else does.

🏆 The $600K ARC Prize
The ARC Prize has been structured to incentivize exactly this kind of breakthrough. With $600K on the line for systems matching human performance, it's one of the few AI prizes that explicitly rewards genuine intelligence rather than benchmark gaming. The prize isn't just money — it's the field's clearest public statement that fluid intelligence remains the unsolved problem.

The AI agent industry has a fundamental marketing problem: it sells capabilities that sound impressive but break in the scenarios that actually matter.

Agents can write code, browse the web, draft emails, analyze documents. These are genuinely useful capabilities. But they’re all examples of applying learned patterns to familiar scenarios.

Put an agent in a genuinely novel environment — one where it must figure out the rules, discover the objectives, and operate without prior context — and it collapses. ARC-AGI V3 is just a clean, measurable version of that failure mode.

Consider what real-world tasks agents are supposed to handle:

Navigating a new enterprise software interface they’ve never seen
Adapting to a customer’s idiosyncratic workflow in real time
Solving problems in domains where training data is sparse
Responding intelligently when the situation doesn’t match any prior example

These are all V3-style problems. They require discovery, not retrieval. Current agents aren’t built for this.

The kicker: the agent industry’s go-to defense — “it gets better with more context and examples” — directly contradicts the V3 premise. You don’t get examples. You get to figure it out.

This is the gap between “benchmark performance” and “real-world adaptability.” The V3 results put numbers on it.

What Happens Next: The Right Lessons From V3

ARC-AGI V3 isn’t a funeral for AI agents. It’s a precise diagnosis of where the field is and where it needs to go.

The lesson for researchers: Scaling transformers may be approaching its limits on the dimensions that matter for general intelligence. Hybrid architectures — combining statistical pattern matching with explicit program synthesis, world models, or something else entirely — are more likely to crack V3-style problems.

The lesson for builders: Don’t oversell adaptability. The systems you’re deploying are excellent at familiar tasks and fragile on genuinely novel ones. That’s still valuable! But know the limitation and design for it.

The lesson for buyers: Benchmark performance is not the same as general capability. A system that aces HumanEval, GSM8K, and MMLU might still fail to navigate your specific, edge-case-laden operational environment.

The lesson for anyone watching the AGI timeline: Chollet’s estimate — 2030, but not via current methods — looks more credible in light of V3 results. We’re not in the final miles. We’re solving a different, harder problem than most roadmaps acknowledge.

The score is 100% humans, 36% Symbolica, 0.3% everything else. The gap is the map.

The AI research community noticed immediately. The X/Twitter reaction ranged from “this is the most important benchmark result in years” to defensive “but our agents handle real tasks” — which is, of course, exactly Chollet’s point.

References:

ARC-AGI V3 launched March 28, 2026. Results based on Day 1 benchmarking across 182 levels. Scores may update as more systems are evaluated against the benchmark.

ARC-AGI V3 Explained: The New AI Benchmark That Breaks Every Agent

The Score That Should Have Everyone Worried

What Is ARC-AGI, and Why Does It Keep Mattering?

V1 → V2 → V3: Closing the Escape Hatches

How ARC-AGI V3 Actually Works

François Chollet’s Vision — and Warning

The Results in Context: What 0.3% Actually Means

Why This Exposes the Agent Industry’s Blind Spot

What Happens Next: The Right Lessons From V3

Stay in the loop

Explore AI Agents

The Score That Should Have Everyone Worried

What Is ARC-AGI, and Why Does It Keep Mattering?

V1 → V2 → V3: Closing the Escape Hatches

How ARC-AGI V3 Actually Works

François Chollet’s Vision — and Warning

The Results in Context: What 0.3% Actually Means

Why This Exposes the Agent Industry’s Blind Spot

What Happens Next: The Right Lessons From V3

Social Signal

Stay in the loop

Explore AI Agents