Short Definition
Decoder-Only Transformers are Transformer architectures that generate outputs autoregressively using a single stack of masked self-attention layers. Each token is predicted based only on previously generated tokens.
This architecture is the foundation of modern large language models such as GPT-style systems.
Definition
A decoder-only Transformer models the probability of a sequence by factorizing it autoregressively:
[
P(x_1, x_2, …, x_n) = \prod_{t=1}^{n} P(x_t \mid x_1, …, x_{t-1})
]
The model predicts the next token given the previous context.
Unlike encoder–decoder architectures, decoder-only models use a single Transformer stack that processes both prompts and generated tokens.
Future tokens are masked during attention using causal masking, ensuring that predictions depend only on past tokens.
Core Idea
Decoder-only Transformers generate sequences step-by-step.
Conceptually:
Prompt → next token → next token → next token
The model repeatedly predicts the next token until the sequence is complete.
Minimal Conceptual Illustration
Example text generation:
Prompt: The cat sat
Prediction sequence:
The cat sat → on
The cat sat on → the
The cat sat on the → mat
Masked Self-Attention
↓
Feedforward Network
↓
Residual Connections
↓
Layer Normalization
key property:
Tokens attend only to previous tokens
This preserves the autoregressive generation process.
Attention Masking
Decoder-only models rely on causal masking to prevent access to future tokens.
Example attention matrix:
Token1 → Token1
Token2 → Token1, Token2
Token3 → Token1, Token2, Token3
Token4 → Token1, Token2, Token3, Token4
Future tokens remain hidden.
Training Objective
Decoder-only models are typically trained using next-token prediction.
Loss function:
[
\mathcal{L}(\theta) =
-\sum_{t=1}^{n} \log P(x_t \mid x_1,…,x_{t-1})
]
This objective teaches the model to generate coherent sequences.
Examples of Decoder-Only Models
Many modern AI systems use this architecture.
Examples include:
- GPT models
- LLaMA
- Claude-style architectures
- PaLM (decoder-style language modeling)
These models power most modern generative AI systems.
Advantages
Decoder-only architectures offer several benefits:
- simple architecture
- scalable training
- flexible prompt-based interaction
- strong generative capabilities
They can perform many tasks through prompt conditioning.
Limitations
Decoder-only models also have limitations.
Sequential Generation
Tokens must be generated one at a time, which limits inference speed.
Quadratic Attention Cost
Self-attention scales with sequence length:
[
O(n^2)
]
This can become expensive for long contexts.
Role in Modern AI
Decoder-only Transformers dominate modern language modeling because they:
- support general-purpose generation
- adapt easily through prompts
- scale effectively with data and compute
They are the foundation of most large language models.
Summary
Decoder-only Transformers are autoregressive architectures that generate sequences using masked self-attention. By predicting each token based on previous context, they provide a flexible and scalable framework for language modeling and generative AI systems.
Related Concepts
- Transformer Architecture
- Autoregressive Models
- Causal Masking
- Self-Attention
- Encoder-Only vs Decoder-Only Transformers
- Prompt Conditioning