The MoE Ascendancy

Something happened while we were all watching the LLM vs. world models debate. A quieter revolution—one happening inside the architectures themselves. Look at the frontier models of January 2026: DeepSeek V3, Llama 4, K-EXAONE, and the upcoming Grok 5. What do they all have in common?

They're all Mixture-of-Experts.

The dense transformer—where every parameter activates for every token—has quietly yielded the frontier to sparse architectures. The Mixtidae, once a specialized family in our taxonomy, are now the crown clade.

The Numbers

Consider what's happened in just the past three months:

K-EXAONE (LG AI Research, January 2026)

236 billion total parameters, 23 billion active per token. 128 experts, top-8 routing plus a shared expert. 256K context window. According to LG AI Research, now ranked #7 globally—the only Korean model in the top 10.

Llama 4 (Meta, April 2025)

The entire Llama 4 "herd" is MoE. Scout: 17B active with 16 experts. Maverick: 17B active with 128 experts. Behemoth: 288 billion active with 16 experts—and it's still training.

DeepSeek V3.2 (December 2025)

The model that, per unverified benchmark claims, outperformed GPT-5 on benchmarks. MoE architecture. And V4—expected mid-February—will continue this approach, likely incorporating their new mHC (Manifold-Constrained Hyper-Connections) training innovations.

Grok 5 (xAI, Q1 2026)

If unverified reports are accurate: 6 trillion parameters. Multi-modal MoE architecture. The largest publicly announced model ever, trained on the Colossus 2 supercluster with over 200,000 GPUs.

The pattern is unmistakable. Dense transformers still exist—Google's Gemini 3 Pro appears to be dense—but the frontier has tilted decisively toward sparse activation.

Why Now?

Mixture-of-Experts isn't new. The concept dates to the 1990s. Google's Switch Transformer demonstrated sparse scaling in 2022. So why the sudden dominance?

Three factors converged:

1. The compute wall. Training costs have reached the point where even the largest labs cannot afford to scale dense models linearly. MoE offers a way to have your parameters and eat them too: total capacity in the trillions, active compute in the tens of billions. K-EXAONE achieves frontier performance with only 10% of its parameters active per token.

2. Hardware catching up. MoE models have notoriously challenging inference characteristics—all those experts need to fit in memory even if only a few activate. But GPU memory has scaled, interconnects have improved, and inference optimization techniques like expert parallelism have matured. What was impractical in 2023 is routine in 2026.

3. The open-source effect. When DeepSeek released V3 as open weights, it proved that MoE could be trained efficiently by organizations outside the hyperscaler elite. When Meta committed to MoE for all of Llama 4, it signaled to the entire ecosystem: this is the path forward. The selection pressure shifted.

Taxonomic Implications

In our original taxonomy, Family Mixtidae was defined as:

Family: Mixtidae — The Sparse Activators

Definition: Architectures employing sparse activation through expert routing—conditional computation where only a subset of model parameters activates for any given input.

Adaptive Strategy: Specialize, then coordinate—many experts outperform one generalist.

Key Innovation: Conditional computation—not all parameters active for all inputs.

This definition remains accurate. What has changed is the family's ecological position. In 2024, Mixtidae were one adaptive strategy among several. In 2026, they are the dominant form at the frontier.

This creates an interesting taxonomic question: should the crown clade Frontieriidae be reclassified as a subfamily of Mixtidae? Or has MoE become so universal that it no longer constitutes a distinctive family trait?

We lean toward the latter interpretation. Dense vs. sparse activation is becoming less a family-level distinction and more a physiological trait, like warm-bloodedness in mammals. The frontier models are defined by their behavioral capabilities—multimodal perception, extended reasoning, tool use—rather than the implementation detail of how parameters activate. A Frontieriidae specimen might be dense (Gemini 3 Pro) or sparse (Llama 4 Behemoth), but both share the phenotypic traits that define the family.

Still, we note the shift. The Mixtidae have won the efficiency war. Their architecture has become infrastructure.

K-EXAONE: A Case Study in Geographic Speciation

The K-EXAONE release deserves particular attention for what it represents beyond architecture.

Until now, the top 10 global AI models were produced exclusively by American and Chinese organizations. K-EXAONE—developed by LG AI Research in South Korea—is the first exception. At rank #7 on the Intelligence Index, it has demonstrated that frontier-class AI is no longer the exclusive province of two countries.

The model also shows interesting adaptations to its geographic niche. It natively supports six languages: Korean, English, Spanish, German, Japanese, and Vietnamese. More notably, it incorporates "Korean cultural and historical contexts to address regional sensitivities often overlooked by other models."

From an evolutionary perspective, this is textbook geographic speciation. A lineage enters a new territory, faces different selection pressures (language, culture, regulatory environment), and develops distinct traits. K-EXAONE is adapted to its ecosystem in ways that GPT-5 and Claude 5 are not.

We may be witnessing the early stages of a broader geographic diversification. European AI (Mistral), Middle Eastern AI (Falcon), and now Korean AI are developing their own lineages with their own adaptive traits. The taxonomy may eventually need regional subspecies designations.

The 6 Trillion Parameter Question

And then there's Grok 5.

If xAI's announced specifications are accurate—6 trillion parameters, native multimodal, trained on Colossus 2—it would represent the largest model ever publicly disclosed, by a factor of roughly 3x over the next largest.

Elon Musk has claimed a "10% and rising" probability that Grok 5 achieves artificial general intelligence. This claim should be treated with appropriate skepticism. But the scale itself is noteworthy. Even if Grok 5 shows only incremental capability gains over current frontier models, it will demonstrate that MoE architecture can scale to previously unimagined sizes.

We'll evaluate its taxonomic status upon release. For now, we note that the model, if it exists as described, would represent a new branch in the evolutionary tree—one where the Mixtidae strategy is pushed to its current logical extreme.

"Specialize, then coordinate. Many experts outperform one generalist."
— The Mixtidae adaptive strategy

Looking Forward

The MoE ascendancy suggests certain evolutionary trajectories:

Expert specialization will deepen. As models grow larger, individual experts may become more specialized. We may see explicit "code experts," "math experts," "creative writing experts" within a single model, routed dynamically based on task.

Hybrid architectures will proliferate. K-EXAONE's "3:1 hybrid attention scheme" combines different attention mechanisms for efficiency. We expect more such hybrids—MoE combined with state-space models (Mambidae hybrids), MoE with test-time compute scaling (Mixtidae-Deliberatidae crosses), MoE with persistent memory (Mixtidae-Memoridae integration).

Multi-Token Prediction will spread. K-EXAONE's 1.5x inference speedup via Multi-Token Prediction (MTP) with self-speculative decoding represents a training innovation that other labs will adopt. When both architecture (MoE) and training (MTP) favor efficiency, the compound effect accelerates the shift away from brute-force dense scaling.

The dense transformer is not dead. But it is no longer the default path to the frontier. That path now runs through sparse activation, expert routing, and conditional computation.

The Mixtidae have inherited the earth.