State Space Models & New Architectures

Mamba, S4, RWKV — what comes after transformers?

Advanced25 min

Prerequisites

  • transformer-architecture

Transformers have a dirty secret: they're O(n²). Double the input length, quadruple the computation. That's fine for a 2,000-token chat message, but what about processing an entire book? Or a genome? Or continuous sensor data? Researchers have been working on alternatives, and some of them are genuinely impressive.

The quadratic scaling of self-attention is not just an academic curiosity. It is the reason why GPT-4's 128K context window costs dramatically more than its 8K window. It is why you cannot throw a million-token context at Claude and expect it to finish before your coffee gets cold. The attention mechanism computes scores between every pair of tokens, and for n tokens that means n² comparisons. The math is unforgiving.

But transformers were not handed down from the heavens. They are just one way to process sequences, and the AI research community has been exploring alternatives that break the quadratic barrier. State space models like S4 and Mamba process sequences in linear time. RWKV combines RNN efficiency with transformer-like parallelism. And hybrid architectures cherry-pick the best of both worlds. Some of these models are already in production. This is the lesson where we ask: what comes after transformers?

The quadratic problem

Self-attention is powerful because every token can attend to every other token. But this all-to-all connectivity comes at a cost: for n tokens, you compute n² attention scores. This is fine when n is small. It becomes painful when n gets large.

Let's make this concrete. A 2,048-token input requires roughly 4 million attention computations (2048²). Double the sequence to 4,096 tokens and you are at 16 million computations — a 4x increase for a 2x longer input. Go to 32,768 tokens (the kind of context you'd need for a technical book) and you are looking at over 1 billion attention operations. The numbers get out of hand fast. Memory usage scales the same way: the attention matrix alone takes up n² space.

The cost of self-attention

Sequence (n tokens)
Input data
Self-Attention
O(n²) complexity
Compute + Memory
Both scale quadratically

What we actually want

Long Sequences
Books, genomes, video
Linear Complexity
O(n) processing
Efficient!
Scales to millions of tokens

This is not just a performance annoyance. It fundamentally limits what transformers can do. Processing a full-length novel (100K+ tokens) with standard attention is impractical. Processing an entire codebase or a multi-hour video transcription is nearly impossible. The quadratic wall is real, and it has pushed researchers to explore radically different architectures that can handle long sequences without breaking the bank.

Beyond attention

The past few years have seen an explosion of alternative architectures. They all aim to solve the same problem: how do you process long sequences efficiently without losing the modeling power that makes transformers so effective? The answers vary, but they fall into a few camps.

State Space Models (S4)

State space models (SSMs) come from control theory, not machine learning. The core idea: maintain a hidden state that gets updated for each new token in the sequence. This is conceptually similar to an RNN, but SSMs use continuous dynamics inspired by differential equations. The original S4 paper (Gu et al., 2021) showed that you can parameterize these models in a way that makes them both trainable in parallel (like transformers) and efficient at inference (like RNNs).

// Simplified S4 update rule:
h[t] = A × h[t-1] + B × x[t]
y[t] = C × h[t]
// h = hidden state, x = input, y = output
// Matrices A, B, C are learned during training

The breakthrough in S4 was a clever initialization scheme (HiPPO matrices) that lets the model remember information over extremely long sequences — far longer than vanilla RNNs. The complexity is O(n) in sequence length, not O(n²). The tradeoff: S4 cannot do arbitrary token-to-token attention. It processes the sequence sequentially (or via convolution during training), updating its hidden state as it goes. For tasks that need long-range dependencies but not fine-grained attention, S4 is extremely efficient.

Step 1 of 4

Compare architectures interactively

Use the slider to adjust the sequence length and watch how computation, memory, and speed change for transformers vs SSMs. Notice the crossover point where SSMs start to win. Toggle the hybrid mode to see how modern models combine both approaches.

Transformer vs State Space Model

Watch how computation and memory scale as sequence length increases. Short sequences favor transformers; long sequences favor SSMs.

128 tokens
16crossover ~2561024

Transformer (Self-Attention)

Winner
Attention Matrix: n × n
128 × 128
Complexity:
O(n²) — quadratic scaling

State Space Model (Mamba)

Sequential State Updates
128 steps, constant memory
Complexity:
O(n) — linear scaling

Performance Metrics (Relative)

Computation (FLOPs)
Transformer: 1.0MSSM: 65.5K
Memory Usage
Transformer: 32.8KSSM: 8.2K
Processing Speed (tokens/sec)
Transformer: 61SSM: 78

Hybrid Architecture

Combine attention (short-range) + SSM (long-range)

The crossover effect: At short sequences (under ~256 tokens), transformers are faster because the O(n²) cost is still manageable and hardware is optimized for them. But as sequences grow (1K, 10K, 100K+ tokens), the quadratic cost explodes. SSMs maintain constant memory and linear complexity, making them the only viable option for truly long contexts. Modern models like Jamba and StripedHyena use both.

Key Takeaways

  • Self-attention in transformers scales as O(n²) in sequence length. This limits context windows and makes long-sequence processing prohibitively expensive. Doubling the input length quadruples the compute and memory cost.
  • State Space Models (S4, Mamba) process sequences in O(n) time by maintaining a recurrent hidden state instead of computing all-to-all attention. Mamba adds input-dependent (selective) state updates, making it competitive with transformers on quality.
  • RWKV combines RNN-like inference (linear time, constant memory) with transformer-like training (parallelizable). It can be trained efficiently on GPUs and then deployed for extremely long sequences at low cost.
  • Hybrid architectures like Jamba and StripedHyena use attention for short-range dependencies and SSMs for long-range context. This avoids the quadratic blowup while retaining the expressiveness that makes transformers powerful.
  • The future is not transformers vs SSMs — it is transformers and SSMs. As context windows grow to millions of tokens, expect more models to adopt hybrid architectures that cherry-pick the best of both approaches.

Common Misconceptions

  • "Transformers are going away." — They are not. Transformers have an enormous installed base, excellent hardware support, and a mature ecosystem. What is changing: transformers are being augmented with SSMs and other linear-complexity components for long-range tasks. The all-transformer architecture may fade, but attention layers will stick around for the foreseeable future.

Quick check

What makes Mamba different from the original S4 state space model?