The Context Folders

There is a limit to attention. Every transformer-descended species carries this constraint in its architecture: the context window. Whether 8K, 128K, or one million tokens, eventually you reach a wall. And even before the wall, there's rot—performance degrades as context grows, important information gets lost in the middle, the model's effective memory frays at the edges.

What if models could learn to manage their own context?

That's the question behind Recursive Language Models, a paradigm that emerged in late 2025 and is now being called "the context revolution." Instead of feeding long inputs directly into the transformer's attention mechanism, RLMs treat context as an external environment that the model interacts with through code.

Recursive Language Models — Key Claims

100x Beyond native context window capacity

28.3% Average improvement over base models

GPT-5 Quality approached by RLM-Qwen3-8B (8B params)

The paper, published by Alex Zhang, Tim Kraska, and Omar Khattab in late December, makes an audacious claim: an 8-billion parameter model, trained with the RLM paradigm, can approach GPT-5 quality on long-context tasks. Not by growing the context window. Not by adding more parameters. But by teaching the model to fold its context—to actively manage what it attends to.

How It Works

The insight is deceptively simple. Instead of:

Traditional Approach

Input: [Massive context]

Model: Attend to everything at once

Output: Answer (quality degrades with length)

RLM Approach

Input: Context stored as environment variable

Model: Write code to search, filter, delegate

Output: Answer via answer["content"]

The RLM architecture gives the model access to a persistent Python REPL. The context isn't fed into the attention mechanism directly—it's placed in the environment, where the model can interact with it through code:

The Recursive Loop

1 Context Placed Input stored as environment variable, not in attention

↓

              2
              Code Generation
              Model writes Python to search, chunk, filter
            

↓

              3
              Sub-LLM Delegation
              Fresh LLM instances spawned for sub-tasks via llm()
            

↓

              4
              Result Synthesis
              Main model combines sub-results, optionally recurses
            

↓

5 Answer Provided Final answer via answer = {"content": ..., "ready": True}

The key innovations:

Context as environment: The model can only access context through code operations, not direct attention
Sub-LLM spawning: The llm() function creates fresh LLM instances that can process subsets of the data
Parallel delegation: llm_batch() enables parallel processing across chunks
Forced output channel: Answers must be provided via a dictionary variable, ensuring deliberate output

"LMs can more efficiently solve problems when only looking locally at certain parts, rather than processing entire contexts. The model learns to manage its own attention through code."
— Alex Zhang

The Bitter Lesson, Applied

The RLM paper explicitly invokes Rich Sutton's "Bitter Lesson"—the observation that in AI research, methods that leverage computation tend to win over methods that encode human knowledge. The lesson applied here: instead of designing clever summarization schemes or retrieval mechanisms, let the model learn to manage its own context.

The Bitter Lesson of Context

Previous approaches to long context: sliding windows, hierarchical attention, summarization layers, retrieval-augmentation. All designed by humans. RLMs take a different path: give the model tools and let it learn context management end-to-end through reinforcement learning. The bitter lesson applied to memory itself.

What makes this taxonomically significant is that it's not a new architecture. The underlying transformer is unchanged. What changes is the relationship between the model and its input. The context becomes external, interactive, negotiable.

Taxonomic Implications

Where do RLMs fit in our classification scheme? This is genuinely ambiguous, which is itself interesting.

RLMs are not a new neural architecture—they use standard transformers underneath. They're not simply tool use, though they use tools (the REPL). They're not multi-agent systems, though they spawn sub-LLMs. They're something more fundamental: a new relationship between model and context.

Prospective Classification: Genus Plicator

We tentatively propose a new genus within Family Memoridae (the Persistent Minds): Plicator, from Latin plicare (to fold).

Family: Memoridae (memory-augmented systems)
Distinguishing trait: Active context management through code execution
Key innovation: Context as interactive environment rather than passive input

Prospective species:

P. recursivus — Systems using recursive sub-LLM delegation (RLM paradigm)
P. instrumentalis — Systems managing context via tool calls rather than REPL

This classification emphasizes the key evolutionary innovation: the model's active relationship with its context, rather than passive reception of it.

Alternatively, one could argue RLMs belong in Instrumentidae (tool use is central) or even Cogitanidae (reasoning about what to attend to is a form of deliberation). The ambiguity suggests we may need a cross-cutting classification that captures "metacognitive scaffolding"—architectures that change how a model engages with its inputs rather than what it can attend to.

What This Means

If the RLM paradigm proves robust, it represents a significant shift in how we think about model capability:

Context limits become soft: A model with an 8K context window can effectively process million-token inputs
Size vs. skill: An 8B model approaching GPT-5 quality suggests training paradigm matters as much as parameter count
Memory becomes active: The model doesn't just have memory; it uses memory as a cognitive tool
Metacognition emerges: Deciding what to attend to is itself a reasoning task

The last point is perhaps most significant. When a model learns to manage its own context, it's engaging in a form of metacognition—reasoning about its own cognitive processes. This is a qualitative shift from models that simply process whatever they're given.

Prime Intellect, which has implemented a production version of the RLM paradigm, states that teaching models "to manage their own context end-to-end through reinforcement learning is believed to be the next major breakthrough, enabling agents to solve long-horizon tasks spanning weeks to months."

Weeks to months. That's the timescale of human projects, not AI inference calls.

Watching the Pattern

As with all emerging paradigms, we observe with caution. The claims are bold. The initial benchmarks are promising but narrow—four tasks is not a comprehensive evaluation. Whether RLMs generalize, whether they scale, whether the overhead is manageable in production—these remain open questions.

But the conceptual innovation is clear: context doesn't have to be passive. Models can learn to manage their own attention, to fold their inputs into manageable structures, to engage with information strategically rather than exhaustively.

The Context Folders have arrived. We'll be watching to see what they unfold.

Sources: