Short Definition
Encoder-Only and Decoder-Only Transformers are two architectural variants of the Transformer model that differ in how they process sequences and generate outputs.
Encoder-Only models produce contextual representations of input sequences, while Decoder-Only models generate sequences autoregressively.
Definition
Transformer architectures can be divided into three structural types:
- Encoder-Only
- Decoder-Only
- Encoder-Decoder
This lexicon entry compares the first two.
The distinction arises from how attention masking and information flow are implemented.
Encoder-Only Transformers
Encoder-Only models process an entire input sequence simultaneously and produce contextual embeddings for each token.
They model:
[
h = f_\theta(x_1, x_2, …, x_n)
]
where the representation of each token depends on all other tokens in the sequence.
Decoder-Only Transformers
Decoder-Only models generate tokens sequentially using autoregressive prediction.
They model the joint probability of a sequence as:
[
P(x_1, …, x_n) = \prod_{t=1}^{n} P(x_t | x_1, …, x_{t-1})
]
Future tokens are masked so that the model only attends to previous tokens.
Core Architectural Difference
Encoder-Only Attention
All tokens attend to all other tokens.
Token1 ↔ Token2 ↔ Token3 ↔ Token4
This allows full bidirectional context.
Decoder-Only Attention
Tokens can only attend to previous tokens.
Token1 → Token2 → Token3 → Token4
This preserves causal sequence generation.
Minimal Conceptual Illustration
Encoder-Only Processing
Input: The cat sat on the mat
Process entire sequence at once
Output contextual embeddings
Used for understanding tasks.
Decoder-Only Generation
Prompt: The cat
Predict: sat
Predict: on
Predict: the
Predict: mat
Used for generation tasks.
Typical Model Examples
Encoder-Only Transformers
Examples include:
- BERT
- RoBERTa
- DeBERTa
Typical applications:
- classification
- question answering
- information retrieval
- sentence embeddings
Decoder-Only Transformers
Examples include:
- GPT models
- LLaMA
- Claude-style architectures
Typical applications:
- text generation
- chat systems
- code generation
- reasoning tasks
Attention Masking
The difference between the two architectures is implemented through attention masking.
Encoder-Only
No masking.
Every token can attend to every other token.
Decoder-Only
Causal masking is applied:
[
Mask(i,j) =
\begin{cases}
0 & j \le i \
-\infty & j > i
\end{cases}
]
This prevents tokens from seeing the future.
Computational Characteristics
| Property | Encoder-Only | Decoder-Only |
|---|---|---|
| Context | Bidirectional | Causal |
| Generation | No | Yes |
| Typical Use | Understanding | Generation |
| Attention Mask | None | Causal mask |
Architectural Trade-offs
Encoder-Only Advantages
- Strong contextual understanding
- Efficient for classification tasks
- Parallel computation over tokens
Decoder-Only Advantages
- Natural sequence generation
- Flexible prompt-based interaction
- Unified modeling of language tasksModern Trends
Recent AI systems heavily favor decoder-only architectures because they support unified generative interfaces.
However, encoder-only models remain important for:
- embeddings
- retrieval systems
- ranking models
- semantic search
Many modern pipelines combine both types.
Summary
Encoder-Only and Decoder-Only Transformers represent two major architectural variants of the Transformer model.
Encoder-Only models are optimized for contextual representation and understanding tasks, while Decoder-Only models generate sequences autoregressively and dominate modern generative AI systems.
Related Concepts
- Transformer Architecture
- Self-Attention
- Causal Masking
- Autoregressive Models
- Encoder–Decoder Models
- Prompt Conditioning