Encoder-decoder models infographic - Neural Networks Lexicon — Encoder-decoder models infographic – Neural Networks Lexicon

Short Definition

Encoder–Decoder Models are neural architectures designed for sequence-to-sequence tasks in which an encoder processes an input sequence into a latent representation, and a decoder generates an output sequence based on that representation.

They are widely used in machine translation, summarization, and other structured generation tasks.

Definition

Sequence-to-sequence problems require mapping an input sequence to an output sequence.

Examples include:

translating a sentence from one language to another
summarizing a document
converting speech to text

Encoder–Decoder models solve this by splitting the architecture into two parts.

Encoder

The encoder reads the input sequence:

[
x_1, x_2, …, x_n
]

and transforms it into a representation:

[
h = Encoder(x_1, …, x_n)
]

This representation captures the meaning of the input.

Decoder

The decoder generates the output sequence step by step:

[
y_1, y_2, …, y_m
]

Each output token is generated based on:

the encoded representation
previously generated tokens

[
y_t = Decoder(h, y_{<t})
]

Minimal Conceptual Illustration

Input sequence
↓
Encoder
↓
Latent representation
↓
Decoder
↓
Output sequence

However, these models struggled with long sequences due to information bottlenecks.

Attention in Encoder–Decoder Models

Attention mechanisms improved the architecture by allowing the decoder to access all encoder states.

Instead of relying on a single vector:

[
h
]

the decoder attends over all encoder outputs:

[
h_1, h_2, …, h_n
]

This allows the model to focus on relevant parts of the input during generation.

Transformer Encoder–Decoder

Modern encoder–decoder models are based on the Transformer architecture.

The encoder uses self-attention to build contextual representations of the input.

The decoder uses:

masked self-attention
cross-attention with encoder outputs

Structure:

Input tokens → RNN Encoder → hidden state
hidden state → RNN Decoder → output tokens

However, these models struggled with long sequences due to information bottlenecks.

Attention in Encoder–Decoder Models

Attention mechanisms improved the architecture by allowing the decoder to access all encoder states.

Instead of relying on a single vector:

[
h
]

the decoder attends over all encoder outputs:

[
h_1, h_2, …, h_n
]

This allows the model to focus on relevant parts of the input during generation.

Transformer Encoder–Decoder

Modern encoder–decoder models are based on the Transformer architecture.

The encoder uses self-attention to build contextual representations of the input.

The decoder uses:

masked self-attention
cross-attention with encoder outputs

Structure:

Encoder:
self-attention → feedforward → stacked layers

Decoder:
masked self-attention
cross-attention to encoder
feedforward

Cross-attention connects encoder and decoder.

Examples of Encoder–Decoder Models

Well-known models include:

T5
BART
mT5
MarianMT

These architectures perform both understanding and generation.

Comparison with Decoder-Only Models

Model Type	Structure	Typical Use
Encoder–Decoder	input encoder + output decoder	translation, summarization
Decoder-Only	autoregressive transformer	language modeling
Encoder-Only	representation learning	classification, retrieval

Encoder–decoder models explicitly model the transformation between sequences.

Advantages

Encoder–decoder architectures provide:

flexible sequence transformation
strong contextual encoding
effective translation between modalities
structured generative capabilities

They are particularly well suited for tasks where input and output differ.

Limitations

Potential drawbacks include:

higher computational cost
more complex architecture
less efficient scaling compared to decoder-only models in some tasks

Large-scale language models often prefer decoder-only designs.

Role in Modern AI

Encoder–decoder models remain essential in tasks such as:

machine translation
document summarization
code transformation
speech recognition
multimodal processing

They are particularly effective when the input structure strongly influences the output.

Summary

Encoder–Decoder models process an input sequence using an encoder and generate an output sequence using a decoder.

By separating representation learning from generation, they provide a flexible architecture for complex sequence transformation tasks.

Modern implementations use Transformer-based attention mechanisms to achieve high performance across many domains.

Related Concepts

Sequence-to-Sequence Models
Transformer Architecture
Cross-Attention
Self-Attention
Decoder-Only Transformers
Recurrent Neural Networks (RNN)
Attention Mechanism