
Short Definition
Encoder–Decoder Models are neural architectures designed for sequence-to-sequence tasks in which an encoder processes an input sequence into a latent representation, and a decoder generates an output sequence based on that representation.
They are widely used in machine translation, summarization, and other structured generation tasks.
Definition
Sequence-to-sequence problems require mapping an input sequence to an output sequence.
Examples include:
- translating a sentence from one language to another
- summarizing a document
- converting speech to text
Encoder–Decoder models solve this by splitting the architecture into two parts.
Encoder
The encoder reads the input sequence:
[
x_1, x_2, …, x_n
]
and transforms it into a representation:
[
h = Encoder(x_1, …, x_n)
]
This representation captures the meaning of the input.
Decoder
The decoder generates the output sequence step by step:
[
y_1, y_2, …, y_m
]
Each output token is generated based on:
- the encoded representation
- previously generated tokens
[
y_t = Decoder(h, y_{<t})
]
Minimal Conceptual Illustration
Input sequence
↓
Encoder
↓
Latent representation
↓
Decoder
↓
Output sequence
However, these models struggled with long sequences due to information bottlenecks.
Attention in Encoder–Decoder Models
Attention mechanisms improved the architecture by allowing the decoder to access all encoder states.
Instead of relying on a single vector:
[
h
]
the decoder attends over all encoder outputs:
[
h_1, h_2, …, h_n
]
This allows the model to focus on relevant parts of the input during generation.
Transformer Encoder–Decoder
Modern encoder–decoder models are based on the Transformer architecture.
The encoder uses self-attention to build contextual representations of the input.
The decoder uses:
- masked self-attention
- cross-attention with encoder outputs
Structure:
Input tokens → RNN Encoder → hidden state
hidden state → RNN Decoder → output tokens
However, these models struggled with long sequences due to information bottlenecks.
Attention in Encoder–Decoder Models
Attention mechanisms improved the architecture by allowing the decoder to access all encoder states.
Instead of relying on a single vector:
[
h
]
the decoder attends over all encoder outputs:
[
h_1, h_2, …, h_n
]
This allows the model to focus on relevant parts of the input during generation.
Transformer Encoder–Decoder
Modern encoder–decoder models are based on the Transformer architecture.
The encoder uses self-attention to build contextual representations of the input.
The decoder uses:
- masked self-attention
- cross-attention with encoder outputs
Structure:
Encoder:
self-attention → feedforward → stacked layers
Decoder:
masked self-attention
cross-attention to encoder
feedforward
Cross-attention connects encoder and decoder.
Examples of Encoder–Decoder Models
Well-known models include:
- T5
- BART
- mT5
- MarianMT
These architectures perform both understanding and generation.
Comparison with Decoder-Only Models
| Model Type | Structure | Typical Use |
|---|---|---|
| Encoder–Decoder | input encoder + output decoder | translation, summarization |
| Decoder-Only | autoregressive transformer | language modeling |
| Encoder-Only | representation learning | classification, retrieval |
Encoder–decoder models explicitly model the transformation between sequences.
Advantages
Encoder–decoder architectures provide:
- flexible sequence transformation
- strong contextual encoding
- effective translation between modalities
- structured generative capabilities
They are particularly well suited for tasks where input and output differ.
Limitations
Potential drawbacks include:
- higher computational cost
- more complex architecture
- less efficient scaling compared to decoder-only models in some tasks
Large-scale language models often prefer decoder-only designs.
Role in Modern AI
Encoder–decoder models remain essential in tasks such as:
- machine translation
- document summarization
- code transformation
- speech recognition
- multimodal processing
They are particularly effective when the input structure strongly influences the output.
Summary
Encoder–Decoder models process an input sequence using an encoder and generate an output sequence using a decoder.
By separating representation learning from generation, they provide a flexible architecture for complex sequence transformation tasks.
Modern implementations use Transformer-based attention mechanisms to achieve high performance across many domains.
Related Concepts
- Sequence-to-Sequence Models
- Transformer Architecture
- Cross-Attention
- Self-Attention
- Decoder-Only Transformers
- Recurrent Neural Networks (RNN)
- Attention Mechanism