Short Definition
Feedforward Networks in Transformers (often called FFN blocks) are position-wise neural networks applied independently to each token representation after the attention layer. They introduce nonlinearity and expand the model’s capacity to transform token embeddings.
Definition
In a Transformer layer, the architecture typically consists of two main components:
- Self-attention mechanism
- Feedforward network
The feedforward network is applied to each token representation independently.
Formally, the FFN computes:
[
FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2
]
Where:
- (x) = token representation
- (W_1, W_2) = learned weight matrices
- (b_1, b_2) = bias terms
- (\max(0, \cdot)) = ReLU activation (or similar)
This transformation allows the model to learn richer representations beyond attention interactions.
Core Idea
Attention layers mix information across tokens, while feedforward layers transform information within each token representation.
Transformer block structure:
Input
↓
Self-Attention
↓
Feedforward Network
↓
Output
The FFN increases the model’s expressive capacity.
Minimal Conceptual Illustration
Consider a token embedding:
Token representation (vector)
↓
Linear expansion
↓
Activation
↓
Linear projection
or visually:
x → Linear → Activation → Linear → x’
Each token undergoes the same transformation.
Position-Wise Operation
A key property of the feedforward network is that it operates **independently for each token**.
If the input sequence is:
\[
X = [x_1, x_2, …, x_n]
\]
then the FFN applies:
\[
FFN(x_i)
\]
to each token separately.
This contrasts with attention layers, which mix information between tokens.
Dimensional Expansion
Typically, the feedforward layer expands the hidden dimension before projecting it back.
Example:
hidden size = 768
FFN expansion = 3072
The tranformation becomes:
768 → 3072 → 768
This expansion provides additional representational capacity.
Common Activation Functions
Different Transformer variants use different nonlinearities.
Common choices include:
– ReLU
– GELU
– SwiGLU
– GeGLU
Recent models often use **gated feedforward networks** for improved performance.
Role in Transformer Performance
Feedforward networks play a critical role in:
– feature transformation
– representation refinement
– nonlinear computation
While attention determines **which tokens interact**, the FFN determines **how representations are transformed**.
Transformer Layer Structure
A full Transformer layer typically follows this structure:
Input
↓
LayerNorm
↓
Self-Attention
↓
Residual Connection
↓
LayerNorm
↓
Feedforward Network
↓
Residual Connection
Both attention and feedforward components are essential.
Computational Cost
Although attention often receives the most focus, FFN layers account for a large portion of the model’s parameters.
In many Transformer models:
– **FFN parameters exceed attention parameters**
This makes FFN design an important factor in scaling models.
Modern Variations
Several improvements have been proposed for Transformer feedforward layers.
Examples include:
– gated linear units (GLU variants)
– mixture-of-experts feedforward blocks
– sparse feedforward layers
These modifications improve efficiency and capacity.
Summary
Feedforward Networks in Transformers are position-wise neural networks applied after attention layers. They transform token representations independently, introduce nonlinear computation, and significantly contribute to the expressive capacity of Transformer models.
Related Concepts
– Transformer Architecture
– Self-Attention
– Multi-Head Attention
– Layer Normalization
– Residual Connections
– Mixture of Experts