Encoder–Decoder Models

Encoder-decoder models infographic - Neural Networks Lexicon
Encoder-decoder models infographic – Neural Networks Lexicon

Short Definition

Encoder–Decoder Models are neural architectures designed for sequence-to-sequence tasks in which an encoder processes an input sequence into a latent representation, and a decoder generates an output sequence based on that representation.

They are widely used in machine translation, summarization, and other structured generation tasks.

Definition

Sequence-to-sequence problems require mapping an input sequence to an output sequence.

Examples include:

  • translating a sentence from one language to another
  • summarizing a document
  • converting speech to text

Encoder–Decoder models solve this by splitting the architecture into two parts.

Encoder

The encoder reads the input sequence:

[
x_1, x_2, …, x_n
]

and transforms it into a representation:

[
h = Encoder(x_1, …, x_n)
]

This representation captures the meaning of the input.

Decoder

The decoder generates the output sequence step by step:

[
y_1, y_2, …, y_m
]

Each output token is generated based on:

  • the encoded representation
  • previously generated tokens

[
y_t = Decoder(h, y_{<t})
]

Minimal Conceptual Illustration

Input sequence

Encoder

Latent representation

Decoder

Output sequence

However, these models struggled with long sequences due to information bottlenecks.


Attention in Encoder–Decoder Models

Attention mechanisms improved the architecture by allowing the decoder to access all encoder states.

Instead of relying on a single vector:

[
h
]

the decoder attends over all encoder outputs:

[
h_1, h_2, …, h_n
]

This allows the model to focus on relevant parts of the input during generation.

Transformer Encoder–Decoder

Modern encoder–decoder models are based on the Transformer architecture.

The encoder uses self-attention to build contextual representations of the input.

The decoder uses:

  • masked self-attention
  • cross-attention with encoder outputs

Structure:

Input tokens → RNN Encoder → hidden state
hidden state → RNN Decoder → output tokens

However, these models struggled with long sequences due to information bottlenecks.

Attention in Encoder–Decoder Models

Attention mechanisms improved the architecture by allowing the decoder to access all encoder states.

Instead of relying on a single vector:

[
h
]

the decoder attends over all encoder outputs:

[
h_1, h_2, …, h_n
]

This allows the model to focus on relevant parts of the input during generation.

Transformer Encoder–Decoder

Modern encoder–decoder models are based on the Transformer architecture.

The encoder uses self-attention to build contextual representations of the input.

The decoder uses:

  • masked self-attention
  • cross-attention with encoder outputs

Structure:

Encoder:
self-attention → feedforward → stacked layers

Decoder:
masked self-attention
cross-attention to encoder
feedforward

Cross-attention connects encoder and decoder.

Examples of Encoder–Decoder Models

Well-known models include:

  • T5
  • BART
  • mT5
  • MarianMT

These architectures perform both understanding and generation.

Comparison with Decoder-Only Models

Model TypeStructureTypical Use
Encoder–Decoderinput encoder + output decodertranslation, summarization
Decoder-Onlyautoregressive transformerlanguage modeling
Encoder-Onlyrepresentation learningclassification, retrieval

Encoder–decoder models explicitly model the transformation between sequences.

Advantages

Encoder–decoder architectures provide:

  • flexible sequence transformation
  • strong contextual encoding
  • effective translation between modalities
  • structured generative capabilities

They are particularly well suited for tasks where input and output differ.

Limitations

Potential drawbacks include:

  • higher computational cost
  • more complex architecture
  • less efficient scaling compared to decoder-only models in some tasks

Large-scale language models often prefer decoder-only designs.

Role in Modern AI

Encoder–decoder models remain essential in tasks such as:

  • machine translation
  • document summarization
  • code transformation
  • speech recognition
  • multimodal processing

They are particularly effective when the input structure strongly influences the output.

Summary

Encoder–Decoder models process an input sequence using an encoder and generate an output sequence using a decoder.

By separating representation learning from generation, they provide a flexible architecture for complex sequence transformation tasks.

Modern implementations use Transformer-based attention mechanisms to achieve high performance across many domains.

Related Concepts

  • Sequence-to-Sequence Models
  • Transformer Architecture
  • Cross-Attention
  • Self-Attention
  • Decoder-Only Transformers
  • Recurrent Neural Networks (RNN)
  • Attention Mechanism