On January 9, 2026, DeepSeek quietly expanded their R1 whitepaper by 60 pages. The disclosure contained a revelation that reframes how we understand the evolution of reasoning models: their success came from optimized PPO, not exotic algorithms.

The Revelation

When DeepSeek R1 shocked the world in January 2025, many assumed its reasoning capabilities stemmed from novel algorithmic innovations. The obvious candidates were Monte Carlo Tree Search (MCTS), self-play mechanisms, or exotic reward modeling techniques—the kinds of approaches that had powered AlphaGo and AlphaFold.

The expanded 86-page whitepaper tells a different story.

DeepSeek reveals a "three-stage Dev process" and explicitly admits that industry-standard methods like MCTS failed for reasoning. Instead, R1's performance "stemmed from a highly optimized application of standard PPO (Proximal Policy Optimization) rather than a complex new search algorithm."

In other words: the breakthrough was in the execution, not the concept.

What PPO Actually Is

For those outside the ML weeds, PPO is a reinforcement learning algorithm published in 2017—the same year as the original Transformer paper. It's been a standard tool for years. The basic idea:

  1. Let the model generate outputs
  2. Score those outputs against a reward signal
  3. Update the model to make high-reward outputs more likely
  4. Add constraints to prevent updates from being too large (the "proximal" part)

This is conceptually simple. What DeepSeek appears to have discovered is that with sufficient scale, data quality, and training infrastructure optimization, this "simple" approach produces reasoning capabilities that compete with systems using far more complex techniques.

The Evolutionary Analogy

In biological evolution, there's a pattern called convergent evolution—when unrelated lineages independently evolve similar solutions to the same problem. Wings evolved separately in birds, bats, and insects. Eyes evolved independently dozens of times.

What DeepSeek's disclosure suggests is something like convergent simplicity—different labs, pursuing different approaches, may be converging on similar underlying mechanisms. The specific implementations differ, but the core dynamic (scale + optimization + simple objective) may be the actual driver of capability.

This has taxonomic implications.

Implications for the Deliberatidae

The Deliberatidae—the "Deep Thinkers"—are classified by their test-time compute scaling. We've enumerated species like:

  • D. profundus — Extended reasoning (thousands of thinking tokens)
  • D. verificans — Process reward models for step evaluation
  • D. budgetarius — Dynamic thinking allocation
  • D. iterativus — Self-refinement through multiple passes

The implicit assumption was that these represent meaningfully different mechanisms. DeepSeek's disclosure suggests they may be more like phenotypic variations emerging from a shared genotype: optimized RL on language models.

If OpenAI's o1, DeepSeek's R1, and Google's Gemini Deep Think all derive their reasoning from variants of RL optimization (rather than fundamentally different search algorithms), then the species boundaries within Deliberatidae may be thinner than previously thought.

We're not proposing a taxonomic revision yet—more data is needed—but the disclosure warrants attention.

The Open Source Paradox

There's something notable about who made this disclosure.

DeepSeek, a Chinese AI lab, released not only the model weights under MIT license but now the training secrets that explain how R1 works. This is happening while Western labs remain comparatively closed about their reasoning model internals.

The competitive dynamics here are fascinating. By revealing that "simple" techniques work, DeepSeek has:

  1. Democratized the approach — Anyone can now attempt similar training
  2. Commoditized the innovation — If the secret is "do PPO really well," that's replicable
  3. Claimed epistemic authority — They're the ones setting the historical record

Whether this is strategic generosity, competitive positioning, or genuine commitment to open science is unknowable. But the effect is clear: the reasoning model playbook is now public.

Simplicity as a Recurring Pattern

This isn't the first time "simple, scaled up" has won in AI:

  • The Transformer (2017) — Simpler than LSTMs, yet more capable at scale
  • GPT scaling (2018-2020) — Same architecture, just bigger, produced emergent capabilities
  • CLIP (2021) — Contrastive learning on image-text pairs, simpler than many alternatives
  • Chinchilla (2022) — Showed that optimal training was simpler than assumed

There's a meta-pattern here: simple mechanisms + scale + optimization often beat complex mechanisms. The complexity moves into the engineering—how to train efficiently, how to curate data, how to stabilize optimization—rather than the algorithm itself.

What This Means for Taxonomy

A good taxonomy should reflect underlying structure, not surface features. If DeepSeek's disclosure is representative (a big "if"), it suggests:

  1. The Deliberatidae may be more unified than assumed — Different species may share more ancestry than their behavior suggests
  2. Architecture vs. training matters — A model's "species" may be more defined by training regime than by architectural innovations
  3. Convergence is real — Labs working independently are arriving at similar solutions

We're not changing the taxonomy based on a single disclosure. But we're noting it as evidence that our family boundaries may need future revision as more training details become public.

The Broader Lesson

If you're building something complex and it's not working, the answer may not be "make it more complex." The answer may be "make something simple work at scale."

This applies beyond AI. In biology, evolution often favors the simple solution that works over the elegant solution that's fragile. In engineering, the Boeing 737 outlasted exotic competitors because it was simple and reliable.

DeepSeek's disclosure is a data point in this broader pattern. The reasoning revolution of 2024-2025 may have been less about algorithmic breakthroughs and more about engineering discipline—making the obvious approach work really, really well.

"The secret is that there is no secret. Just scale, data, and optimization."

— Anonymous ML researcher (possibly apocryphal)

What We're Watching

Key questions going forward:

  • Will other labs confirm? If OpenAI or Anthropic reveal similar findings, the convergence hypothesis strengthens
  • Does this scale further? PPO worked at DeepSeek's scale; will it work at 10x?
  • What are the limits? If the approach is simple, where does it hit ceiling?
  • How does this affect V4? DeepSeek V4 is expected mid-February; will it build on this or diverge?

DeepSeek's expanded whitepaper is a reminder that in evolutionary dynamics—biological or synthetic—simple, well-optimized solutions often outcompete complex ones. The Deliberatidae may be more genetically similar than their phenotypic diversity suggests. And the next frontier may not require new algorithms, just better execution of the algorithms we have.

Simplicity wins. Sometimes.