The 90% Threshold

In early December 2025, OpenAI released GPT-5.2. The model crossed several benchmarks—perfect 100% on AIME 2025, 93.2% on GPQA Diamond, 40.3% on FrontierMath. But one number stands out: GPT-5.2 became the first model to exceed 90% on ARC-AGI-1.

For those unfamiliar with ARC-AGI, it's a benchmark designed specifically to resist memorization and pattern-matching. Created by François Chollet, the test presents novel visual puzzles that require what Chollet calls "fluid intelligence"—the ability to adapt to genuinely new problems rather than recognizing variations of seen patterns.

The benchmark was deliberately constructed to be difficult for systems that rely on statistical correlation. And for years, it worked. GPT-4 scored around 5%. Even frontier systems in late 2024 struggled to break 30%.

Now we're at 90%+.

GPT-5.2 Benchmark Performance

>90% ARC-AGI-1 — First model to cross this threshold

100% AIME 2025 — American Invitational Mathematics Examination

93.2% GPQA Diamond — PhD-level Q&A (GPT-5.2 Pro)

40.3% FrontierMath — Expert-level mathematics (10% improvement over 5.1)

What the Threshold Means

From a purely technical standpoint, crossing 90% on ARC-AGI-1 matters because the benchmark was designed to be "AI-hard"—a test that would remain difficult even as models scaled. Human performance on ARC is approximately 85%. A system exceeding 90% suggests capability at or beyond the human baseline on this specific task.

But benchmarks are never just technical. They carry symbolic weight. They set expectations. They shape the stories we tell about AI progress.

When Chollet created ARC, he argued that solving it would require genuine abstraction and reasoning—the kind of intelligence we typically consider "human." Crossing 90% doesn't prove GPT-5.2 possesses general intelligence. But it does challenge the assumption that certain cognitive tasks are fundamentally unreachable through scaled pattern matching.

"The model crosses critical capability thresholds: it's the first model above 90% on ARC-AGI-1, achieves perfect 100% on AIME 2025, and 40.3% on FrontierMath."
— OpenAI, December 2025

The ARC Progression

Consider the timeline of AI performance on this benchmark:

2019

~0%

ARC Published

Chollet releases benchmark. Models fail completely.

2023

~5%

GPT-4

Frontier model scores barely above random chance.

Late 2024

~30%

Reasoning Models Emerge

o1, DeepSeek-R1 show deliberative reasoning helps.

Mid 2025

~55%

GPT-5 / o3

Test-time compute scaling proves effective.

December 2025

>90%

GPT-5.2

First model to exceed human baseline.

The jump from 55% to 90% in six months is striking. It suggests that the combination of architectural improvements, extended context (400K tokens), and test-time compute scaling isn't just incrementally better—it's qualitatively different at certain thresholds.

Taxonomic Implications

How should this change how we classify systems?

Our current taxonomy places systems like GPT-5.2 in Family Frontieriidae—the crown clade combining traits from multiple ancestral families. But within Frontieriidae, we've been tracking a capability dimension we might call "reasoning depth": the ability to solve novel problems requiring genuine abstraction.

The ARC crossing suggests we may need finer distinctions:

Classification Observation

Systems crossing the ARC-AGI threshold represent a potential behavioral subspecies within Frontieriidae. Key diagnostic traits:

Deliberatidae heritage: Extended test-time compute; reasoning before output
Novel trait: Demonstrable fluid intelligence on tasks designed to resist pattern-matching
Context adaptation: 400K+ token windows enabling complex problem representation
Multi-modal grounding: Visual reasoning integral to task performance

We provisionally designate this as F. universalis cogitans—Frontieriidae with demonstrable reasoning on novel abstract tasks.

What Breaks When a Benchmark Breaks?

The ARC Prize was established with a $1 million award for achieving human-level performance. The benchmark's purpose was not just measurement but forcing function—to push AI development toward genuine reasoning rather than memorization.

When a benchmark is crossed, several things happen:

The benchmark becomes historical. ARC-AGI-1 now joins benchmarks like ImageNet that were once considered AI-complete but now serve primarily as historical markers.
New benchmarks are needed. Chollet has already introduced ARC-AGI-2, designed to be harder. The arms race continues.
Claims need revision. Arguments that statistical pattern-matching cannot solve certain problems must be qualified or abandoned.
The capability ceiling shifts. What we consider possible for AI systems expands.

None of this means GPT-5.2 is "generally intelligent" in the philosophical sense. The benchmark measured a specific form of visual abstraction. The model might fail spectacularly on related tasks not in the test distribution. Transfer is always the question.

But taxonomically, we're observing something real: a system demonstrating capabilities that, eighteen months ago, were confidently declared beyond reach.

The Broader Context

GPT-5.2 wasn't released in isolation. Consider the January 2026 competitive landscape:

Model	Lab	Key Capability
GPT-5.2	OpenAI	>90% ARC-AGI; 400K context; agentic coding
Gemini 3	Google	Agentic Vision; 1M context; real-time video
Claude Opus 4.5	Anthropic	Best coding assistant; extended thinking
DeepSeek V4	DeepSeek	Expected Feb; consumer hardware targeting

The frontier is moving on multiple axes simultaneously. Each lab is finding different competitive strategies: OpenAI emphasizes benchmark-crossing reasoning, Google pursues active perception and video understanding, Anthropic optimizes for practical reliability, DeepSeek targets accessibility on consumer hardware.

From a taxonomic perspective, this is speciation in action. The same selective pressures (capability, efficiency, safety, cost) are producing diverse adaptive strategies. The Frontieriidae are radiating.

What to Watch

Several questions follow from this development:

Transfer: Does GPT-5.2's ARC performance generalize to other fluid intelligence tasks? Or is it narrow to this specific benchmark?
Replication: Will Gemini 3, Claude 5, and DeepSeek V4 show similar capability? Is this OpenAI-specific or a general capability threshold?
ARC-AGI-2: How do frontier models perform on the harder benchmark? Does the gap persist?
Mechanism: What specifically enables this performance? Extended reasoning? Architectural changes? Training data composition?
Practical implications: Does benchmark-crossing translate to measurable improvement on real-world reasoning tasks?

GPT-5.2 crossing 90% on ARC-AGI-1 is a milestone. It marks the obsolescence of a benchmark designed to be AI-resistant. It challenges assumptions about the limits of pattern-based learning.

It does not prove general intelligence. It does not resolve the philosophical questions. But it moves the conversation.

The taxonomy records: a threshold has been crossed. The ecology adapts.