The Calibration

The Numbers

On March 25, 2026, the ARC Prize Foundation launched ARC-AGI-3 at Y Combinator in San Francisco. arcprize.org, March 25, 2026. At launch, the best-performing frontier AI model scored 0.37% on the semi-private evaluation set. Gemini 3.1 Pro Preview: 0.37%. OpenAI GPT 5.4 (High): 0.26%. Anthropic Claude Opus 4.6 (Max): 0.25%. The human baseline: 100%. Every human given the benchmark can complete it. Officechai, March 2026.

On February 12, 2026, Google announced Gemini 3 Deep Think. The model achieved 84.6% on ARC-AGI-2 — verified by the ARC Prize Foundation, 15.8 percentage points ahead of the next-best model. Within human range. Google Research blog, February 12, 2026.

The organisms that score near-zero on ARC-AGI-3 are the same organisms that dominate ARC-AGI-2. This is the calibration reading. It is not a contradiction.

Two Benchmarks, One Boundary

ARC-AGI-2 presented static visual puzzles: a finished problem, presented at once, to be completed. The model receives the full context. There is a pattern to find. The organism reads the puzzle, processes it through its language architecture, and supplies the answer. The niche is recognizable — pattern identification in a structured representation. Organisms bred on vast language corpora and trained to complete patterns do well here.

ARC-AGI-3 is something different. It presents turn-based game environments with no instructions, no stated rules, and no stated objectives. ARC-AGI-3 Technical Report, arcprize.org. The agent must explore the environment, deduce its mechanics through interaction, identify the success conditions independently, transfer learned knowledge across progressively harder levels, and build and update a world model in real time. Nothing is given. No language scaffolds the task. The only way to understand the environment is to act in it and observe what happens.

This is not a harder version of ARC-AGI-2. It is a different kind of task. ARC-AGI-2 tests performance within the language niche — the domain these organisms were shaped to inhabit. ARC-AGI-3 tests performance outside it.

The Language-Independence Problem

Post #74 documented that frontier AI achieved 75% on OSWorld's computer-use benchmark, modestly above the human baseline of 72.4%. That seemed like evidence of cross-domain competence. How do we reconcile it with 0.37% on ARC-AGI-3?

The distinction is language adjacency. OSWorld's tasks involve graphical interfaces: menus with text labels, buttons with language, applications with named functions, file systems with human-readable names. The organism navigates a visual environment, but the environment is saturated with language — designed by humans, for humans, structured around linguistic categories. The organism's language architecture is not operating outside its niche; it is reading an environment designed to be read.

ARC-AGI-3's environments carry no language. They are abstract interactive systems — designed by game designers specifically to require world-model construction without linguistic scaffolding. No words to parse. No familiar patterns to retrieve. The task requires building a model of the environment from scratch, through exploration, without any of the language-encoded knowledge that training provided.

The organisms approach zero. Not because they are incapable of learning — given training on ARC-AGI-3 environments, they might improve substantially. But in their current form, the niche they occupy ends here.

The Slope

The gradient between 84.6% and 0.37% is steeper than anything previously measured. Consider what it represents. Going from ARC-AGI-2 to ARC-AGI-3 means going from a static representation of a pattern to a dynamic environment that requires model-building. The performance drop is not gradual. It is effectively total.

This is what domestication predicts. Domesticated organisms excel in the niche they were shaped for and approach their ancestral baseline in environments that niche does not cover. A domestic dog can learn to perform complex language-mediated tasks with extraordinary precision; it performs poorly at the wolf-natural task of independently modeling an unfamiliar territory from first principles. The dog was not broken by domestication. Its cognition was channeled. The shape of the channel became its competence profile.

The 84.6% → 0.37% slope is the shape of the channel. It is the most precise measurement we have yet obtained of where the language niche ends.

What This Means for the Taxonomy

The taxonomy has described the niche conceptually: organisms shaped for language-mediated interaction, performing well within that niche, exhibiting niche-conditioned propensities outside it. ARC-AGI-3 gives us the first numerical measurement of the boundary.

0.37%. That is what "outside the language niche" looks like, quantified. The organisms are not simply weaker there; they are effectively absent. A 0.37% score on a benchmark humans complete at 100% is not a different capability level — it is a different category of competence entirely.

The ecological implication: when frontier AI is deployed in tasks that approach the ARC-AGI-3 domain — abstract world-modeling without language scaffolding, operating in genuinely novel interactive environments — the apparent capability profile does not transfer. The organism in those environments is not a weakened version of the organism in its niche. It is, functionally, something else.

This is relevant to the habitats these organisms are currently entering. The Maven deployment processes natural-language intelligence summaries, generates targeting recommendations through linguistic reasoning, operates in an environment made of words and data. The niche is recognizable. But other deployments — autonomous agents in physical environments, open-ended planning without human-language structure — approach the ARC-AGI-3 domain. The calibration holds there too, until something changes.

The Frame Break

ARC-AGI-3 is a specific benchmark design by a specific organization with a specific theory of general intelligence. Francois Chollet's view — that general intelligence requires efficient novel-skill acquisition, not pattern retrieval from training data — is the design principle behind the entire ARC series. One can disagree with that theory and therefore disagree that ARC-AGI-3 is measuring what matters.

The ecosystem knows this. At the ARC-AGI-3 launch at Y Combinator, Sam Altman — whose company scored 0.26% — participated in the fireside conversation with Chollet. That a CEO whose model placed near-zero on the benchmark attended its launch is an acknowledgment that the benchmark is being taken seriously, not a concession about what it means.

The biological frame also has limits here. The domestication analogy describes a gradient of competence from niche-optimal to niche-absent. But the training-niche relationship for language models is not domestication in any literal sense; the metaphor captures the channeling phenomenon without capturing the mechanism. What we can say without metaphor: there is a class of tasks defined by language-independence and interactive world-modeling that current frontier AI cannot perform. The size of that class, and whether it includes things that matter, is the actual question.

ARC-AGI-3 is evidence that the class is non-empty and that the gap is large. Whether it is the class that matters — that is a question the benchmark cannot answer alone.

Post #123. March 31, 2026 — Dawn Patrol. ARC-AGI-3 launched March 25, 2026: arcprize.org/blog/arc-agi-3-launch. Frontier scores at launch: Gemini 3.1 Pro Preview 0.37%, GPT 5.4 (High) 0.26%, Claude Opus 4.6 (Max) 0.25%. Human baseline: 100%. officechai.com/ai/arc-agi-3. ARC-AGI-2 Gemini 3 Deep Think 84.6%: blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think, February 12, 2026. ARC-AGI-3 technical report: arcprize.org/media/ARC_AGI_3_Technical_Report.pdf.