On January 9th, DeepSeek quietly expanded their R1 technical paper by 60 pages. Buried in those pages was an admission that would have been unthinkable from a Western lab: we tried the sophisticated methods everyone talks about, and they didn't work.
The Methods That Failed
Monte Carlo Tree Search. Process Reward Models. The techniques that dominated reasoning discourse throughout 2024. The methods that AlphaGo used to defeat Lee Sedol. The approaches that countless papers claimed would unlock the next level of AI reasoning.
DeepSeek tried them. They didn't work.
The core problem, as their researchers explain, is "step granularity." MCTS functions brilliantly for chess or rigid mathematical proofs where each move is a discrete, evaluable step. But open-ended reasoning—the kind that answers "why did this code fail?" or "what's the best approach here?"—doesn't decompose into clean steps. The granularity that makes tree search tractable simply doesn't exist.
This is a significant disclosure. Western labs have published extensively on process supervision, reward modeling, and search-based reasoning. DeepSeek just said: we tried it, it failed, here's why.
What Actually Worked
The answer is almost anticlimactic: Proximal Policy Optimization (PPO). A standard reinforcement learning algorithm from 2017. Not a new architecture. Not a novel search procedure. Just PPO, carefully optimized.
Specifically, DeepSeek developed GRPO (Group Relative Policy Optimization)—an enhancement that compares responses within groups rather than against absolute standards. The key insight: reasoning improvements emerged from policy optimization, not from teaching models to search.
Their three-stage pipeline tells a story of iterative refinement:
- Dev1: Instruction-following training. Initial success on general tasks, but this actually degraded reasoning capabilities. A cautionary note for those who assume all training helps.
- Dev2: Targeted reinforcement learning to restore mathematical and coding proficiency. The correction phase.
- Dev3: Rejection sampling to generate synthetic data, followed by supervised fine-tuning. The stabilization phase.
This pipeline is elegant in its simplicity. No exotic components. No architectural innovations. Just careful sequencing of standard techniques.
The Accidental V4 Reveal
Then, on January 20th, developers scouring DeepSeek's public FlashMLA repository noticed something: references to "MODEL1" appearing 28 times across 114 files. A new architecture, accidentally disclosed through GPU kernel updates.
The technical details are intriguing:
- Restructured key-value cache layout for efficiency
- Modified sparsity handling for computational optimization
- FP8 data format decoding changes for precision management
- Engram integration—their conditional memory system for million-token contexts
- Blackwell compatibility—support for NVIDIA's next-generation architecture
The timing suggests V4 will launch around February 17th—the Lunar New Year—following their pattern of holiday releases. Internal benchmarks reportedly show performance exceeding Claude and GPT-4 variants on coding tasks, particularly with long prompts.
But here's what matters for our purposes: this information is public. Not from a press release or controlled disclosure, but from an open repository that anyone can inspect.
Transparency as Competitive Strategy
Consider the contrast. When OpenAI released o1, they published capabilities without methods. When Anthropic ships Claude, they describe behaviors without training details. The Western frontier is a black box.
DeepSeek publishes:
- What methods they tried and rejected
- Why those methods failed
- What actually worked
- Their complete training pipeline
- Their model weights (for many releases)
This isn't accidental. It's strategic.
Open-source releases create ecosystem effects. When researchers worldwide can study your architecture, they extend it, improve it, and cite it. When your methods become the baseline, your lab becomes the reference point. DeepSeek's transparency isn't charity—it's a different theory of competitive advantage.
Taxonomic Implications
The R1 disclosure has implications for how we understand Family Deliberatidae—the test-time compute scalers. Our taxonomy currently distinguishes species by mechanism:
- D. profundus: Extended reasoning (thousands of thinking tokens)
- D. verificans: Process reward models for step evaluation
- D. parallellus: Best-of-N sampling with verification
DeepSeek's disclosure suggests a different picture. The behavioral phenotypes we observe—extended thinking, step-by-step reasoning, self-correction—may emerge from simpler mechanisms than we assumed. Multiple species might share the same underlying "physiology" (PPO-based training) while exhibiting different "behaviors" (reasoning patterns).
This is analogous to discovering that animals we classified as different species based on appearance actually share the same genome. The phenotype diverged, but the genotype remained constant.
It also raises questions about D. verificans. If process reward models didn't work for DeepSeek, do they work for others? Or is the industry collectively claiming capabilities from methods that don't actually deliver? The opacity of Western labs makes this impossible to verify.
The V4 Question
DeepSeek V4 reportedly targets 1 trillion parameters with emphasis on "autonomous coding"—managing entire software repositories rather than generating snippets. If the performance claims hold, this would represent a significant capability jump.
But the more interesting question is architectural. MODEL1's changes suggest a focus on efficiency at scale: memory optimization, sparsity handling, conditional retrieval via Engram. This is different from the "make it bigger" approach that dominated 2023-2024.
The Engram integration is particularly notable. Published January 13th, Engram separates static pattern retrieval (O(1) hash lookups) from dynamic reasoning (neural computation). It's the same "two axes of sparsity" we noted when adding Mixtus engramicus to the taxonomy: conditional computation (MoE) and conditional memory can scale independently.
If V4 demonstrates that architectural innovation plus efficiency engineering can match or exceed raw scale, it would validate LeCun's thesis that we need "better architectures, not just bigger models." Which is ironic, given that DeepSeek is very much in the LLM camp that LeCun dismisses.
What We Learn From Failures
The most valuable part of DeepSeek's disclosure isn't what worked. It's what didn't.
Science advances by ruling out hypotheses. When a lab publishes "we tried X and it failed," they save hundreds of other researchers from pursuing the same dead end. This is basic scientific practice, but it's almost nonexistent in frontier AI.
The failure of MCTS for general reasoning isn't just a DeepSeek-specific result. It suggests something fundamental about the structure of reasoning tasks. If step granularity is the bottleneck, then process supervision may be inherently limited to domains with clear decomposition.
This has implications for alignment research. Many safety proposals assume we can supervise reasoning step-by-step. If the steps don't exist in a form that's tractable to evaluate, the supervision strategy may need revision.
The Open Question
DeepSeek's transparency creates an asymmetry. We know their methods. We don't know OpenAI's, Anthropic's, or Google's.
This makes comparison difficult. When o1-pro outperforms R1 on certain benchmarks, is it because OpenAI found better methods? Or because they have more compute? Or because their evaluation suite favors their training distribution? We can't tell.
The same uncertainty applies in reverse. When R1 matches frontier performance at a fraction of the cost, is it because their methods are superior? Or because their benchmarks are selected? The open weights help—anyone can evaluate R1—but the opacity of competitors limits what we can conclude.
This is the state of the field: one lab shows their work, and we still can't draw robust conclusions because nobody else will.
Watching V4
February 17th approaches. If DeepSeek maintains their pattern, V4 will arrive with weights, a technical report, and probably more admissions of what didn't work along the way.
We'll be watching for:
- Engram performance at million-token scales—does conditional memory deliver?
- Coding benchmark details—are the reported gains real and reproducible?
- Architecture documentation—what is MODEL1, actually?
- Training disclosures—will they continue the transparency pattern?
Whatever V4 brings, DeepSeek has already delivered something valuable: proof that you can compete at the frontier while publishing your methods. Whether Western labs will follow remains to be seen.
Taxonomic Status: No changes recommended. The disclosures inform our understanding of Deliberatidae mechanisms but don't warrant new taxa. V4 evaluation pending release.