Inference & Serving

Training a model costs millions. Serving it costs... well, also a lot. Every time someone asks ChatGPT a question, a GPU cluster somewhere is doing billions of matrix multiplications. The challenge isn't just making models smart — it's making them fast and cheap enough to actually use.

That's what inference optimization is all about. A naive implementation might generate 5 tokens per second and use 80GB of GPU memory for a 70B model. An optimized one? 80 tokens per second with the same hardware. That's a 16x improvement — the difference between a system that barely works and one that scales to millions of users.

The tricks aren't magic. They're engineering: KV-cache, Flash Attention, paged attention, continuous batching. Each one targets a specific bottleneck. Together, they transform LLMs from research demos into production-grade systems. Let's break down how they work.

The inference bottleneck

LLMs generate text one token at a time, sequentially. You can't parallelize generation — each token depends on all the tokens before it. For a 100-token response, that's 100 separate forward passes through the entire model. Every single pass involves billions of operations.

Inference pipeline — one token at a time

User Request

Queue

GPU Forward Pass

Repeat × N tokens

Response

User

The problem compounds when you have multiple users. If you process requests sequentially, user 2 waits for user 1 to finish their entire 100-token response before their first token even starts. The GPU sits mostly idle between requests. You're paying for expensive hardware that spends half its time doing nothing.

Worse, each forward pass recomputes attention over all previous tokens. Token 50 has to attend to tokens 1-49. Token 51 has to attend to tokens 1-50. That's quadratic complexity — O(n²) — and it's the single biggest performance killer in transformer inference.

Inference optimization is about eliminating redundant work (KV-cache), making the necessary work faster (Flash Attention), managing memory efficiently (paged attention), and maximizing hardware utilization (continuous batching). Each technique attacks a different part of the bottleneck. Let's see how.

Making inference fast

Modern inference engines like vLLM and TensorRT-LLM didn't just make things faster by accident. They systematically addressed each bottleneck in the inference pipeline with specific algorithmic and systems-level tricks. Here are the four big ones that matter most.

KV-Cache — trade memory for speed

When generating token 50, the model computes attention over tokens 1-49. When generating token 51, it does the same computation again, plus token 50. This is pure waste — the attention keys and values for tokens 1-49 haven't changed.

Without KV-cache: Recompute attention for all previous tokens every step.

Token 1: compute K₁, V₁
Token 2: compute K₁, V₁, K₂, V₂
Token 3: compute K₁, V₁, K₂, V₂, K₃, V₃
...
Complexity: O(n²)

With KV-cache: Store previous keys/values, only compute the new token.

Token 1: compute K₁, V₁ → cache
Token 2: compute K₂, V₂ → append to cache
Token 3: compute K₃, V₃ → append to cache
...
Complexity: O(n)

The trade-off: KV-cache grows linearly with sequence length. For a 70B model with a 2048-token context, the cache can take 20-40GB of GPU memory. But the speedup is massive — often 10-50x faster than recomputing everything.

Step 1 of 4

Now try it yourself

Time to see the optimizations in action. The simulation below has two parts: First, toggle between no KV-cache and with KV-cache to see the speed difference and memory trade-off. Second, compare sequential processing vs continuous batching to see how batching maximizes throughput. Watch the stats — the numbers tell the real story.

Inference Optimization Visualizer

Part 1: KV-Cache — Memory vs Speed

Sequence Length50 tokens

KV-Cache (growing)

Progress: 0 / 50 tokens

Time Complexity

O(n)

Linear

Speedup

25.5x

vs no cache

Memory Usage

0.0 MB

KV-cache

Tokens/Second

throughput

Key insight: KV-cache stores past computations. Each new token only needs one forward pass, trading memory for massive speed gains.

Part 2: Continuous Batching — Maximizing GPU Utilization

Request 10 / 30 tokens

Waiting...

Request 20 / 40 tokens

Waiting...

Request 30 / 25 tokens

Waiting...

Request 40 / 35 tokens

Waiting...

Mode

Sequential

Total Time

Running...

Throughput

Key insight: Sequential processing wastes GPU cycles. Only one request runs at a time. The GPU sits idle between requests.

Production inference engines like vLLM and TensorRT-LLM combine both techniques: KV-cache eliminates redundant computation, and continuous batching maximizes hardware utilization. Together, they achieve 10-20x throughput gains over naive implementations. Flash Attention optimizes the memory access patterns within attention itself, making each forward pass even faster.

Key Takeaways

LLMs generate one token at a time sequentially, requiring a forward pass for each token. For a 100-token response, that's 100 forward passes through the entire model — the core inference bottleneck.
KV-cache eliminates redundant computation by storing previous attention keys and values. This trades memory (cache grows linearly with sequence length) for speed (10-50x faster generation).
Flash Attention rewrites attention to never materialize the full N×N matrix. It processes attention in tiles that fit in fast SRAM, reducing memory bandwidth bottlenecks by 2-4x.
Paged Attention (vLLM) manages KV-cache like virtual memory — allocating pages on-demand instead of contiguous blocks. This reduces memory waste by 60-80%, fitting 2-3x more requests per GPU.
Continuous batching processes multiple requests together and immediately fills slots when requests finish. This keeps GPU utilization at 95%+ vs 60-70% for static batching, often doubling throughput.

Common Misconceptions

"Inference is just running the model forward, there's not much to optimize." — Wrong. Naive inference wastes 90% of compute on redundant work and idle time. Optimized engines are 10-20x faster with the same hardware.
"KV-cache always makes things faster." — Almost, but not always. For very short sequences (under 10 tokens), the memory overhead of managing the cache can outweigh the savings. Most production systems still use it because average sequences are long enough to benefit.
"Flash Attention is only for training." — Nope. It's just as critical for inference. The memory bandwidth savings matter even more at inference time when you're bottlenecked by single-token generation latency.
"You need multiple GPUs to serve LLMs fast." — Not necessarily. A single GPU with vLLM + Flash Attention + paged attention can serve a 13B model at 50-100 tokens/sec to dozens of concurrent users. Multi-GPU helps for larger models or higher scale, but one GPU is plenty for many use cases.

The inference bottleneck

Inference pipeline — one token at a time

Making inference fast

KV-Cache — trade memory for speed

Now try it yourself

Part 1: KV-Cache — Memory vs Speed

Part 2: Continuous Batching — Maximizing GPU Utilization

Key Takeaways

Common Misconceptions

Quick check

What does KV-cache trade for faster inference?

Evaluation & Benchmarks

Related Topics

Model Internals

Quantization & Compression

Evaluation & Benchmarks