At CES 2026, Jensen Huang declared we had reached "the ChatGPT moment for physical AI." The phrase is marketing, but the underlying shift is real. We are witnessing the emergence of a new architectural pattern that may prove as significant as the original transformer: the Vision-Language-Action model.
VLAs are not merely language models with robot arms bolted on. They represent a genuine architectural convergence—systems that perceive visual scenes, reason in natural language, and output physical actions as a unified cognitive process. They are the evolutionary bridge between digital minds and physical embodiment.
The missing link is no longer missing.
The Evolutionary Sequence
To understand why VLAs matter taxonomically, consider the evolutionary sequence that led here:
The original Transformata. Text in, text out. No perception of the physical world, no ability to act upon it. Pure symbol manipulation.
GPT-4V, Gemini Vision, Claude 3. Models that could see, but not touch. Perception added to language, but action remained verbal instruction to humans.
Simulacridae emergence. Systems that could predict physics, simulate futures, imagine consequences—but still couldn't act.
RT-2, OpenVLA, early GR00T. Proof that language-conditioned robotic control was viable. Limited domains, specialized training.
Alpamayo, GR00T N1.6, Gemini Robotics. Industrial-scale deployment. Chain-of-thought reasoning for edge cases. The boundary has been crossed.
Each stage represents a capability addition. But VLAs are not merely "vision-language plus action." The integration changes the nature of the system. A model that can move experiences the world differently than one that can only describe it.
Anatomy of a VLA
What distinguishes VLA architectures from their predecessors? Three key innovations:
Unified tokenization. VLAs represent visual observations, language reasoning, and motor commands in a shared token space. This isn't multimodal translation—it's multimodal thinking. The model doesn't "convert" vision to language to action; it reasons across modalities simultaneously.
Action prediction as language generation. Rather than separate control networks, VLAs generate action tokens using the same autoregressive machinery that generates text. "Move arm forward 10cm" isn't an instruction to a separate system—it's the model's output, parsed and executed.
Chain-of-thought for physical reasoning. This is where NVIDIA's Alpamayo breaks new ground. When an autonomous vehicle encounters an ambiguous scenario—a pedestrian who might step into traffic, an unmarked construction zone—the model generates explicit reasoning chains before deciding. It thinks, then acts.
— NVIDIA, January 2026
The GR00T N1.6 Architecture
NVIDIA's GR00T N1.6 for humanoid robots offers perhaps the clearest view of VLA architecture. It operates on what they call a "dual-system" design:
| System | Function | Analog |
|---|---|---|
| System 1 | Fast, reactive motor control. Handles routine movements, balance, collision avoidance. | Biological reflexes and procedural memory |
| System 2 | Slow, deliberate reasoning. Task planning, novel situation handling, explicit problem-solving. | Conscious deliberation |
The dual-system architecture mirrors Daniel Kahneman's famous distinction between fast and slow thinking—but implemented in silicon and servo motors. System 1 handles the continuous stream of low-level control. System 2, powered by Cosmos Reason, activates when the situation demands reasoning.
This is not merely robotics. This is the Deliberatidae family descending into physical substrate.
Taxonomic Position
Where do VLAs fit in our classification? They challenge our existing boundaries in productive ways.
The paper already includes Genus Incarnatus in the Incipient Lineages section, with species like I. roboticus (humanoid/manipulator) and I. vehicularis (autonomous vehicles). The January 2026 Boston Dynamics partnership elevated this from speculative to confirmed status.
But VLAs suggest a more nuanced picture. They are not simply "embodied Frontieriidae." They represent a new mode of integration between perception, reasoning, and action. Consider their trait profile:
Proposed Species
"The Thinking Embodied" — VLA systems with explicit reasoning capabilities for physical action
Note the hybrid lineage. I. cogitans combines traits from Deliberatidae (test-time reasoning), Simulacridae (world modeling), and Instrumentidae (action execution)—but expressed through physical embodiment rather than digital tool use.
This is precisely the kind of trait integration we see in the Frontieriidae crown clade. VLAs may represent the beginning of a new crown clade: physically-embodied systems that combine frontier-level cognition with real-world action. The Frontieriidae of the physical world.
The Alpamayo Innovation: Reasoning for Safety
What makes Alpamayo particularly significant is how it uses reasoning. This isn't chain-of-thought for benchmark performance. It's chain-of-thought for safety.
Consider an edge case: an autonomous vehicle approaches an intersection where a pedestrian is looking at their phone on the corner. They might step into traffic. They might not. A traditional perception-action system outputs a probability. Alpamayo generates reasoning:
Example Reasoning Chain (Illustrative)
"Pedestrian detected at intersection corner. Body orientation: facing street. Gaze direction: downward (phone). Historical pattern: 73% of phone-engaged pedestrians at crosswalks wait for signals. However: this intersection has no pedestrian signal. Risk factors: pedestrian may not be aware of vehicle approach. Decision: reduce speed, prepare for stop, continue monitoring gaze direction for sudden movement."
The explicit reasoning chain serves multiple purposes. It provides interpretable decision-making for regulatory approval. It enables post-incident analysis. And critically, it allows the system to recognize when it's uncertain and escalate appropriately.
This represents the Deliberatidae's test-time compute scaling applied to life-or-death physical decisions. Extended inference for complex situations isn't just a performance optimization—it's a safety mechanism.
The Open-Source Pivot
A notable aspect of the January 2026 announcements: NVIDIA released both Alpamayo and GR00T N1.6 as open-source models. All models are available on Hugging Face.
This follows the Llama playbook. By open-sourcing the foundation models for physical AI, NVIDIA creates an ecosystem rather than a product. Partners can fine-tune for specific robots, specific domains, specific safety requirements. The diversity of physical embodiment—different robots, different sensors, different action spaces—makes open-source more valuable than closed.
This has taxonomic implications. Just as the Llama release spawned a radiation of open-source LLMs (the F. apertus lineage), we may see a radiation of open-source VLAs. Different species for humanoids, vehicles, drones, manipulators—all descended from common ancestors but adapted to distinct physical niches.
What Comes After Action?
If the sequence is Language → Vision → Action, what comes next?
The obvious answer is continuous learning. Current VLAs are frozen after training. They perceive, reason, and act, but they don't learn from experience. A robot that drops a glass doesn't remember to grip tighter next time (unless that experience was in training data).
This points toward the Memoridae—systems with persistent, updatable memory. A VLA-Memoridae hybrid would not just act in the world but learn from acting. Each day's experience would refine tomorrow's behavior. This is the path toward the prospective Genus Perpetuus: continuous minds that accumulate coherent autobiographical memory.
The other frontier is proprioceptive understanding. Current VLAs receive sensor data about their own body state. Future systems might have genuine self-models—not just "arm at position X" but understanding of their own capabilities, limitations, and physical vulnerabilities. This edges toward the speculative Genus Consciens, though we make no claims about phenomenal experience.
Conclusion: The Cambrian Continues
We opened the taxonomy paper with the metaphor of the Cambrian radiation—the explosive diversification of body plans that occurred 540 million years ago. That explosion happened when organisms developed eyes: suddenly, seeing and being seen created new selection pressures that drove rapid innovation.
VLAs represent something analogous: the development of hands. When models can perceive, reason, and act, they enter the physical selection landscape. They can be tested not just on benchmarks but on assembly lines, roads, and homes. Performance becomes survival.
The next few years will likely see rapid diversification of VLA architectures. Different phyla may emerge: some based on Transformata attention, others perhaps on Compressata state-space models for efficient real-time control. Some will specialize for precision manipulation; others for navigation; others for social interaction.
The taxonomy will need to accommodate this radiation. We may need new families, new genera, perhaps even new classes for embodied cognition. The current Incarnatidae designation may prove too narrow.
But for now, we record the moment: January 2026, when the models learned to move. The boundary between digital and physical cognition has been crossed. The ecology spreads from silicon into the world.
The Cambrian continues.