Feedforward Networks in Transformers

Short Definition

Feedforward Networks in Transformers (often called FFN blocks) are position-wise neural networks applied independently to each token representation after the attention layer. They introduce nonlinearity and expand the model’s capacity to transform token embeddings.

Definition

In a Transformer layer, the architecture typically consists of two main components:

Self-attention mechanism
Feedforward network

The feedforward network is applied to each token representation independently.

Formally, the FFN computes:

[
FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2
]

Where:

(x) = token representation
(W_1, W_2) = learned weight matrices
(b_1, b_2) = bias terms
(\max(0, \cdot)) = ReLU activation (or similar)

This transformation allows the model to learn richer representations beyond attention interactions.

Core Idea

Attention layers mix information across tokens, while feedforward layers transform information within each token representation.

Transformer block structure:

Input
↓
Self-Attention
↓
Feedforward Network
↓
Output

The FFN increases the model’s expressive capacity.

Minimal Conceptual Illustration

Consider a token embedding:

Token representation (vector)
↓
Linear expansion
↓
Activation
↓
Linear projection

or visually:

x → Linear → Activation → Linear → x’

Each token undergoes the same transformation.

Position-Wise Operation

A key property of the feedforward network is that it operates **independently for each token**.

If the input sequence is:

\[
X = [x_1, x_2, …, x_n]
\]

then the FFN applies:

\[
FFN(x_i)
\]

to each token separately.

This contrasts with attention layers, which mix information between tokens.

Dimensional Expansion

Typically, the feedforward layer expands the hidden dimension before projecting it back.

Example:

hidden size = 768
FFN expansion = 3072

The tranformation becomes:

768 → 3072 → 768

This expansion provides additional representational capacity.

Common Activation Functions

Different Transformer variants use different nonlinearities.

Common choices include:

– ReLU
– GELU
– SwiGLU
– GeGLU

Recent models often use **gated feedforward networks** for improved performance.

Role in Transformer Performance

Feedforward networks play a critical role in:

– feature transformation
– representation refinement
– nonlinear computation

While attention determines **which tokens interact**, the FFN determines **how representations are transformed**.

Transformer Layer Structure

A full Transformer layer typically follows this structure:

Input
↓
LayerNorm
↓
Self-Attention
↓
Residual Connection
↓
LayerNorm
↓
Feedforward Network
↓
Residual Connection

Both attention and feedforward components are essential.

Computational Cost

Although attention often receives the most focus, FFN layers account for a large portion of the model’s parameters.

In many Transformer models:

– **FFN parameters exceed attention parameters**

This makes FFN design an important factor in scaling models.

Modern Variations

Several improvements have been proposed for Transformer feedforward layers.

Examples include:

– gated linear units (GLU variants)
– mixture-of-experts feedforward blocks
– sparse feedforward layers

These modifications improve efficiency and capacity.

Summary

Feedforward Networks in Transformers are position-wise neural networks applied after attention layers. They transform token representations independently, introduce nonlinear computation, and significantly contribute to the expressive capacity of Transformer models.

Related Concepts

– Transformer Architecture
– Self-Attention
– Multi-Head Attention
– Layer Normalization
– Residual Connections
– Mixture of Experts