Short Definition
Decoder-Only and Encoder–Decoder Transformers represent two major design strategies for sequence modeling. Decoder-Only models generate outputs autoregressively from a single stack, while Encoder–Decoder models separate input understanding (encoder) from output generation (decoder), leading to different trade-offs in flexibility, efficiency, and performance across tasks.
Definition
Transformer architectures can be broadly categorized into two widely used families:
- Decoder-Only architectures
- Encoder–Decoder architectures
Both rely on attention mechanisms but differ in how information flows between input and output sequences.
Decoder-Only Models
A decoder-only architecture models the probability of a sequence autoregressively:
[
P(x_1, …, x_n) = \prod_{t=1}^{n} P(x_t \mid x_1, …, x_{t-1})
]
The model uses causal masking so that tokens only attend to previous tokens.
These models typically process prompts and generated tokens in a single shared representation space.
Encoder–Decoder Models
Encoder–Decoder models separate the process into two stages:
- Encoder processes the input sequence
- Decoder generates output conditioned on encoder representations
The decoder predicts tokens as:
[
P(y_t \mid y_1,…,y_{t-1}, x)
]
Where:
- (x) = encoded input sequence
- (y) = generated output sequence
This architecture explicitly separates input comprehension and output generation.
Core Architectural Difference
Decoder-Only Architecture
Prompt tokens → Transformer stack → next token prediction
All tokes share the same stack.
x1 → x2 → x3 → x4
Causal masking ensures sequential prediction.
Encoder–Decoder Architecture
Input → Encoder → encoded representation
↓
Decoder → generated output
The decoder uses cross-attention to access encoder outputs.
Minimal Conceptual Illustration
Decoder-Only Generation
Prompt: Translate English to French:
“The cat sat”
Model predicts sequentially:
→ “Le”
→ “chat”
→ “s’est”
→ “assis”
The prompt and output are processed together in one sequence.
Encoder–Decoder Generation
The prompt and output are processed together in one sequence
Encoder–Decoder Generation
The decoder attends to the encoded input representation.
Architectural Components
| Component | Decoder-Only | Encoder–Decoder |
|---|---|---|
| Self-attention | Yes | Yes |
| Cross-attention | No | Yes |
| Causal masking | Yes | Yes |
| Input encoding stage | No | Yes |
Cross-attention allows the decoder to directly reference encoded input features.
Performance Trade-offs
Decoder-Only Advantages
- simpler architecture
- scalable training
- unified interface for many tasks
- strong performance in generative AI
These models dominate large language model deployments.
Examples include:
- GPT models
- LLaMA
- Claude-style architectures
Encoder–Decoder Advantages
- explicit separation of input and output
- efficient conditional generation
- strong performance on structured sequence tasks
These architectures perform well on:
- translation
- summarization
- structured generation
Examples include:
- T5
- BART
- original Transformer model
Computational Trade-offs
| Property | Decoder-Only | Encoder–Decoder |
|---|---|---|
| Architecture complexity | lower | higher |
| Memory usage | lower | higher |
| Conditional modeling | indirect | direct |
| Prompt flexibility | very high | moderate |
Decoder-only models often scale better for large general-purpose systems.
Modern Design Trends
Recent large-scale language models favor decoder-only architectures because they:
- simplify training pipelines
- support prompt-based interfaces
- scale efficiently with data and compute
However, encoder–decoder architectures remain strong in tasks where input–output structure matters.
Summary
Decoder-only and encoder–decoder Transformers represent two major approaches to sequence modeling. Decoder-only architectures generate outputs autoregressively using a single stack, while encoder–decoder architectures separate input processing and output generation through cross-attention. The choice between them depends on the trade-off between architectural simplicity, scalability, and task-specific performance.
Related Concepts
- Transformer Architecture
- Encoder-Only vs Decoder-Only Transformers
- Cross-Attention
- Self-Attention
- Autoregressive Models
- Prompt Conditioning
- Sequence-to-Sequence Models