The LLM is always the bottleneck. In every production system I’ve built, the same pattern emerges. Your backend sits idle. Your database purrs at 5% CPU. And your language model - the brilliant, expensive centerpiece - grinds through requests at 2 seconds while users expect 200 milliseconds. Costs spiral. Latency kills user experience.
The problem isn’t configuration or model selection. It’s the architecture itself. Transformers were designed for parallel training but forced into sequential generation. Stateless by design but requiring growing state. Optimized for throughput but deployed for latency. Every optimization exists to bridge these gaps. To make LLMs fast, you first need to understand why they’re slow, and that starts with how they actually work under the hood.
Part I: Understanding the Machine
History and Evolution of GPT-style LLMs
Attention is All You Need (2017)
The transformer wasn’t designed to be slow. It was designed to be trainable: massively parallel, highly efficient at learning from enormous datasets.1 That’s the original sin. An architecture optimized for training was deployed for inference, where different constraints matter.
Before transformers, language models relied on recurrence. LSTMs and RNNs processed text sequentially, one token at a time, maintaining a fixed-size hidden state. They were slow to train (each token had to wait for the previous one) but fast to generate. The bottleneck was in the wrong place.
The 2017 “Attention is All You Need” paper flipped this.1 Replace recurrence with self-attention, and suddenly you can train on entire sequences in parallel. Every token can look at every other token simultaneously. Training time collapsed. The architecture scaled beautifully with GPUs designed for parallel matrix operations.
But that parallelism comes with a cost that only becomes apparent during generation. The transformer is technically stateless: each forward pass is independent. Yet to generate a sequence, it must maintain increasingly rich context about everything that came before. Not through a fixed hidden state like RNNs, but through accumulated attention to a growing history.
Understanding the architecture at different scales:
At the highest level, a conversation consists of multiple turns: each message from you, each response from the model. Every time the model generates a response, it’s running a sequence of forward passes. One pass to generate the first token, another for the second, and so on. Each token depends on all previous tokens, forcing this sequential generation.
Within each forward pass, the model processes through multiple transformer layers (typically 32-96 layers in modern LLMs). Each layer contains the same set of components repeated:
- Multi-head self-attention: The current position attends to all previous positions
- Position-wise feed-forward network: Dense layers applied to each position independently
- Layer normalization and residual connections: Stabilize and preserve information
Before any of this happens, the input goes through:
- Tokenizer and vocabulary: Text becomes integer indices (32K-200K possible tokens)
- Positional encodings: Position information added since attention has no inherent sequence notion
The original transformer used an encoder-decoder architecture for translation. But during training, you can process an entire sequence at once: all positions, all layers, in parallel. During generation, you’re stuck in a loop. Generate one token, run the entire model, generate the next token, run the entire model again.
Each forward pass goes through dozens of layers. Each layer performs attention over the entire growing context. The sequence length increases with every token generated. The computation grows quadratically with sequence length in the attention mechanism.
The transformer solved the training problem brilliantly. It created the inference problem we’re still solving.
OpenAI & the Generative Revolution (2018)
Google’s transformer was built for translation: encode the source sentence once, decode the target sentence. Two passes, fixed scope. The encoder processes the entire input in parallel, the decoder generates the output while attending to that encoded representation. Inference was still sequential in the decoder, but at least the encoding was done.
OpenAI’s GPT threw away the encoder entirely.2
Decoder-only architecture, trained to predict the next token given all previous tokens. Simpler to train, easier to scale. But the implications for inference: you’re always in the expensive sequential loop. Every token generation is decoding. Every token attends to every previous token, and that context never stops growing.
This is autoregressive generation. It’s the reason your LLM is the bottleneck. It’s also the reason LLMs can debug code, hold conversations, and reason through problems step by step. The same property that makes them slow makes them capable.
When you ask GPT a question, the model reads your prompt in a single forward pass. This part is still parallel, still fast. Then generation begins. First token: full forward pass through all layers. Second token: another full forward pass, now attending to prompt + first token. Third token: full forward pass, attending to everything so far. The model must see its own output to produce the next piece.
You can’t generate tokens in parallel because each depends on the one before. You can’t precompute the attention because the context grows with every step. But this is also why the model can maintain coherent reasoning across thousands of tokens, building arguments piece by piece. The sequential generation isn’t just a limitation. It’s the architecture thinking.
The transformer is mathematically stateless. Each forward pass is independent: feed it the same input, get the same output, always. No hidden state persists between function calls. Yet during generation, the model clearly maintains rich internal understanding about the conversation, the task, the patterns it’s following.
Where does this state live? In the growing sequence itself. Not in the architecture, but in the context the model must repeatedly process. An LSTM had explicit state that was fixed-size (cheap to maintain but limited in capacity). A transformer has no explicit state, but its implicit state (the full context) grows without bound.
The architecture is stateless. The computation is stateful.
You can’t compress the context without losing information. You can’t cache it cheaply because attention patterns change as context grows. You can’t parallelize generation because the state is the sequence itself.
This is why “generative” changed everything. The original transformer had expensive generation too, but at least encoding was a one-time cost and the scope was bounded: translate this sentence, then you’re done. GPT made generation the entire task, and that task has no predetermined end. Every response is hundreds or thousands of sequential forward passes through billions of parameters, each pass attending to a longer and longer history.
That gap between stateless architecture and stateful computation? That’s where all our performance problems live, and where all the capability comes from.
The Dual Highway Architecture
How Information Really Flows
If transformers are slow, we need to see where the time actually goes. Not all computation scales the same way.
There are two distinct flows of information, and they have completely different performance characteristics.3
The residual stream flows vertically. At each position, information passes up through layers: Layer 1 processes the token, adds its contribution, passes to Layer 2. Layer 2 adds more, passes to Layer 3. Up through 32, 60, or 96 layers. The final output is the sum of contributions from every layer. Nothing gets replaced, everything accumulates.
This is inherently serial. You can’t compute layer 32 until layer 31 is done. That’s the depth cost: linear in the number of layers.
The K/V stream flows horizontally. At each layer, each position needs information from all previous positions in the sequence. Not just adjacent tokens but all of them. The model at position 1000 can reach back to position 1 if needed.
This is the real cost. The attention mechanism that enables this horizontal flow scales O(n²) with sequence length. Double your context, quadruple the computation.
Each cell is a computation node that needs information both vertically (from the previous layer at the same position) and horizontally (from all previous positions at the current layer).
You can’t skip the vertical flow: every layer potentially adds critical information. You can’t skip the horizontal flow: that’s how the model sees context. During generation, both highways are active for every single token produced. The vertical flow is expensive but manageable. The horizontal flow is where costs explode.
When you generate a 1000-token response, the final token needs to attend to 999 previous positions, at every layer, through the entire vertical stack. Meanwhile, the residual stream just passes through 32 or 96 layers (same as always).
This is why sequence length is the constraint that matters most. Not model size, not parameter count. Length.
The Attention Mechanism Decoded
At each node in that grid, the residual stream arrives carrying accumulated information. Three transformations occur simultaneously: Query (Q), Key (K), and Value (V). Each is a learned linear projection of the current state.
Their semantic roles:3
- Key: “This is what I’m about, how future positions should find me relevant”
- Value: “This is what I’ll contribute when someone attends to me”
- Query: “What information from the past do I need right now?”
The Key and Value are properties of this position at this layer. Once computed, they’re done. They describe what this position is and what it offers.
The Query is different. It’s a question about what’s needed now, from the perspective of the current token being generated. It changes with every new token.
The Query compares against every Key in the cache (every previous position at this layer). Dot products measure relevance. Softmax turns scores into attention weights. Most positions get near-zero weight. A few dominate.
Those weights multiply the Values and sum. That’s the information retrieved from context: what the model “decided” mattered for this position.
Then it goes through the MLP (two dense layers with nonlinearity) and adds back into the residual stream. The MLP holds about 2/3 of the model’s parameters but scales linearly with sequence length. Not the bottleneck.
The bottleneck is that attention computation: Query against every Key, for every new token, at every layer. You can’t precompute it because the Query depends on the current generation state. You can’t skip positions because the model needs to dynamically decide what matters.
But notice the asymmetry: Keys and Values for past tokens are static. Once “The cat sat on the” is processed at layer 15, those K/V pairs never change. Only the Query from new tokens changes, asking “what from that history do I need?”
Cache what’s static. Recompute what’s dynamic. That asymmetry is the only reason LLM generation is feasible at all. Without it, you’d recompute everything from scratch for every token.
With it, you just have a cache that grows without bound, consuming memory at a rate proportional to sequence_length × num_layers × hidden_dimension. Still expensive. Still the bottleneck. But at least it’s tractable.
Where the Parameters Live
When people talk about a “70 billion parameter model,” where are those parameters actually sitting?
About two-thirds are in the MLPs: those feed-forward networks at each layer. Dense matrices that transform information after attention has retrieved it. Billions of weights that need to be loaded from memory, multiplied, and stored. These are the bulk of the model.
The remaining third is in the attention mechanism: the projection matrices that generate Q, K, and V from the residual stream.
This distribution is backwards from what you’d expect given the performance profile. The MLPs hold most of the parameters but scale linearly with sequence length. Process 1000 tokens, do 1000 MLP operations. Process 2000 tokens, do 2000 MLP operations. Expensive, yes (billions of parameters to multiply) but predictable.
Attention holds fewer parameters but scales quadratically. Process 1000 tokens, compute 1 million attention operations (1000²). Process 2000 tokens, compute 4 million operations (2000²).
The majority of parameters aren’t creating the majority of the bottleneck.
This matters for optimization strategy. You can quantize the MLPs aggressively (reduce precision from 16-bit to 8-bit or even 4-bit floats) and often the model barely notices.4 Those parameters matter for what the model knows, but they’re more tolerant to compression.
The attention mechanism is more sensitive. Compress those K/V pairs too aggressively and the model loses its ability to recall and relate information across context. The horizontal highway degrades.
During generation, you load all those billions of MLP parameters from GPU memory, do the math, get the result. That’s compute-bound: you’re doing real work. But modern GPUs have teraflops of compute capacity. They can multiply matrices fast.
The attention computation is different. For each new token, you need to load the entire KV cache from memory: all those cached Keys and Values from previous positions, across all layers. Compute the attention weights. Load the Values again, do the weighted sum. This is memory-bound.5 You’re spending more time moving data than computing.
Modern GPUs: fast at math, relatively slow at memory access. The bottleneck isn’t that attention is hard to compute. It’s that attention requires touching memory that grows with every generated token.
So you have a model where most parameters aren’t the problem, and the part that is the problem doesn’t even have that many parameters. It just requires increasingly expensive memory access patterns.
Can we make this faster? Only if we can either reduce memory movement, compress the KV cache without losing information, or accept that we don’t need the full quadratic attention. The parameters themselves aren’t really the constraint. It’s what we have to do with them, over and over, for every token generated.
The Generation Reality
Prefill and Decode: Two Different Worlds
When you send a prompt to an LLM, two completely different operations happen. Understanding why they’re different explains where the performance bottleneck actually lives.5
Prefill is reading your prompt. The model processes all input tokens in a single forward pass: all positions simultaneously, in parallel. You give it 500 tokens of context, it evaluates them all at once. Computes K/V pairs for each position, stores them in the cache, and we’re done. This phase is compute-bound: the GPU is doing real work, crunching through those billions of parameters.
Prefill is fast. Or at least, it’s as fast as it can be. You’re using the transformer the way it was designed: parallel processing of a complete sequence.
Decode is generation. One token at a time. Sequential. The model produces token 1, which gets added to context. Then it generates token 2, using token 1 as additional context. Then token 3, using tokens 1 and 2. Each new token requires running the entire model again: all layers, all parameters, full forward pass.
You can’t parallelize this. Token N+1 literally depends on token N existing. There’s no way around it. The architecture is fundamentally autoregressive.
Worse: decode is memory-bound.5 The GPU isn’t struggling with computation. It’s struggling with memory access. For each token generated, the model must load the growing KV cache from memory, compute attention against it, and store the result. Load, compute, store. Repeat. The cache gets bigger with every token. Memory bandwidth becomes the constraint.
During prefill, you’re exercising the full parallel capacity of your hardware. During decode, you’re bottlenecked by how fast you can shuttle data back and forth from memory, one token at a time.
In a typical request, prefill might take 200ms for a 500-token prompt. Then decode takes 2000ms to generate a 200-token response. You spend 90% of your time in decode, generating tokens sequentially, constrained by memory bandwidth.6
Can you make decode faster? Only if you can reduce memory movement, compress what you’re storing, or somehow avoid loading the full cache for every token. The sequential generation itself is unavoidable: that’s what makes the model work. But the memory access pattern during that sequential process? That’s where optimization lives.
The State That Shouldn’t Exist
Transformers are stateless. Each forward pass is independent: feed the same input, get the same output, always. No hidden state carries over between function calls. That’s core to the architecture.
Yet during generation, the model clearly maintains rich understanding about everything that’s happened in the conversation. It builds on previous context, maintains coherent reasoning, refers back to earlier points. Where is that state?
In the KV cache. It’s not part of the architecture. It’s a necessity of generation.
Compare this to RNNs, the models transformers replaced. An RNN had explicit state: a fixed-size hidden vector that got updated with each token. Process 10 tokens or 10,000 tokens, the state was always the same size. Constant memory cost, regardless of sequence length. Cheap to maintain, limited in capacity.
A transformer has no explicit state. But to generate token 500, it needs access to the Keys and Values from tokens 1 through 499, across all layers. That’s the horizontal information highway we saw earlier. It must be preserved. You can’t compress it into a fixed-size vector without losing the ability to attend to specific earlier positions.
The cache grows linearly with sequence length: 1000 tokens, 1000 cached positions. 2000 tokens, 2000 cached positions. Each position stores K and V vectors at each layer. In a typical 32-layer model with 4096 hidden dimensions, that’s roughly 32 × 4096 × 2 = 262,000 floating-point numbers per token. Per request.7
At 2 bytes per number (16-bit precision), a 2000-token context needs about 1GB of GPU memory just for its KV cache. Run 10 concurrent requests, that’s 10GB. Twenty requests, 20GB. Your GPU has 40GB or 80GB total. The cache crowds out everything else.
This is why batch size matters so much for throughput. You want to process multiple requests simultaneously to maximize GPU utilization, but each request’s KV cache competes for the same memory. The longer the sequences, the fewer concurrent requests you can handle.
The stateless architecture, forced into stateful generation, consumes memory that grows without bound. This isn’t a bug. The model needs that information to generate coherent, contextual responses. But it’s also why your GPU memory fills up, why batch sizes stay small, why serving costs scale with conversation length.
Can you reduce this? Only if you’re willing to lose some of that horizontal information flow: forget earlier context, approximate the attention, or find clever ways to compress what you’re storing. The state that shouldn’t exist is the state you can’t eliminate.
The Memory Wall
Now we can see the complete picture of why LLMs are slow.
Modern GPUs deliver teraflops of compute: trillions of floating-point operations per second. They can multiply massive matrices almost as fast as you can feed them data. The hardware is absurdly powerful for computation.
But memory bandwidth hasn’t kept pace. Moving data between GPU memory and compute units is orders of magnitude slower than the actual math. This gap (between compute capacity and memory bandwidth) is the memory wall.8
During prefill, you’re compute-bound. The model is doing real work with all those parameters, processing the entire prompt in parallel. The GPU’s compute power gets used.
During decode, you’re memory-bound. Each token generation requires loading the entire KV cache from memory, computing attention weights, loading Values again, doing the weighted sum. The actual computation is trivial: dot products and additions. But loading gigabytes of cached state from memory, over and over, for every single token? That’s where the time goes.
The cache grows with every token generated. More positions to load, more memory to traverse. Attention is O(n²) in both time and memory: double the sequence length, quadruple both the computation and the memory access required.
Then multiply by batch size. You want high throughput, so you serve multiple requests simultaneously. Each request has its own KV cache competing for the same memory bandwidth, the same GPU capacity. The longer the conversations, the larger the caches, the fewer requests you can handle at once.
This is the fundamental constraint. Not parameter count, not model architecture, not even the quadratic attention complexity itself. Memory bandwidth during sequential token generation.
Every architectural decision we traced (decoder-only design, stateless computation requiring growing state, the dual highway of information flow) converges here. The transformer wasn’t built to be slow. It was built to be trainable. The performance characteristics we’re dealing with are byproducts of that design goal, now exposed under generation workloads the architecture was never optimized for.
Can we fix this?
That’s the question that drives everything that follows. The architecture isn’t changing: these models exist, they work, they’re deployed at scale. But understanding where the bottleneck lives reveals where intervention is possible. Some optimizations target memory movement.9 Some compress what’s being stored.4 Some accept trade-offs in capability for gains in speed. None of them are free.
The performance problem is architectural. The solutions have to work within those constraints.
What’s Next
We’ve seen why transformers are slow by design: the dual-highway architecture that makes them powerful also creates fundamental bottlenecks. The stateless architecture forced into stateful generation. The memory wall that dominates inference performance.
But understanding the problem is just the beginning.
Part 2 of this series will explore the optimization landscape: quantization techniques that compress billions of parameters into smaller representations, caching strategies that exploit computational structure, and the hidden costs these optimizations impose on model capabilities. We’ll see how FlashAttention revolutionized memory access patterns, how PagedAttention applies operating system principles to KV cache management, and why not all speed improvements are created equal.
Part 3 will examine system-level optimizations: batching strategies that balance latency against throughput, parallelism techniques that distribute computation across devices, and the emerging co-design of hardware and software. We’ll look at where current research is headed and what questions remain open.
The architecture creates the constraints. The optimizations work within them. Understanding both is what separates configuration from engineering.
References
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762 ↩︎ ↩︎
-
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf ↩︎
-
repligate. (2024). How Information Flows Through Transformers. Twitter/X. https://x.com/repligate/status/1967420298350805414 ↩︎ ↩︎
-
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314. https://arxiv.org/abs/2305.14314 ↩︎ ↩︎
-
NVIDIA. (2023, November 17). Mastering LLM Techniques: Inference Optimization. NVIDIA Technical Blog. https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ ↩︎ ↩︎ ↩︎
-
Lienhart, P. (2024, May 12). LLM Inference Series: 3. KV caching explained. Medium. https://medium.com/@plienhar/llm-inference-series-3-kv-caching-unveiled-048152e461c8 ↩︎
-
Lages, J. (2025, April 6). Transformers KV Caching Explained. Medium. https://medium.com/@joaolages/kv-caching-explained-276520203249 ↩︎
-
Lienhart, P. (2024, March 8). LLM Inference Series: 5. Dissecting model performance. Medium. https://medium.com/@plienhar/llm-inference-series-5-dissecting-model-performance-6144aa93168f ↩︎
-
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems, 35, 16344-16359. https://arxiv.org/abs/2205.14135 ↩︎