Short Definition
Layer normalization normalizes activations across features within a single sample.
Definition
Layer normalization is a normalization technique that standardizes activations by computing statistics (mean and variance) across the feature dimension of each individual data sample. Unlike batch normalization, it does not depend on batch-level statistics and behaves identically during training and inference.
Layer normalization is sample-wise, not batch-wise.
Why It Matters
Many modern architectures—especially transformers and sequence models—operate with small or variable batch sizes where batch normalization becomes unstable or impractical. Layer normalization provides consistent normalization regardless of batch size, enabling stable optimization in these settings.
It is essential for batch-independent architectures.
How Layer Normalization Works
For a given input vector ( x ) with features ( x_1, \dots, x_d ):
- Compute mean and variance across features
- Normalize each feature using these statistics
- Apply learned scale and shift parameters
The normalization is performed independently for each sample.
Minimal Conceptual Formula
LN(x) = γ · (x − mean(x)) / sqrt(var(x) + ε) + β
where statistics are computed over the feature dimension.
Layer Normalization vs Batch Normalization
- Layer Normalization
- statistics per sample
- batch-size independent
- same behavior during training and inference
- common in transformers and RNNs
- Batch Normalization
- statistics across batch
- sensitive to batch size
- different training vs inference behavior
- common in CNNs
They solve similar problems in different regimes.
Where Layer Normalization Is Used
Layer normalization is a core component of:
- transformer architectures
- large language models
- attention mechanisms
- recurrent neural networks
- reinforcement learning agents
Most transformer blocks assume its presence.
Relationship to Optimization Stability
Layer normalization stabilizes gradients by preventing activation scale drift across layers. This allows:
- higher learning rates
- deeper networks
- more predictable optimization behavior
It works synergistically with residual connections.
Interaction with Residual Connections
In transformer-style architectures, layer normalization is often placed:
- before residual blocks (pre-norm)
- after residual blocks (post-norm)
Pre-norm designs improve gradient flow in very deep networks.
Effects on Generalization
Layer normalization primarily improves optimization stability. Its impact on generalization is indirect and architecture-dependent, often mediated through better convergence and reduced sensitivity to initialization.
It is not a regularizer by design.
Computational Characteristics
- constant-time per sample
- no dependency on batch statistics
- minimal memory overhead
- deterministic behavior across runs
These properties make it well-suited for large-scale training.
Common Pitfalls
- assuming layer norm replaces proper initialization
- mixing batch norm and layer norm inconsistently
- ignoring norm placement (pre vs post)
- applying layer norm blindly to all architectures
- overlooking interactions with learning rate schedules
Normalization choice is architectural.
Relationship to Other Normalization Methods
Layer normalization contrasts with:
- batch normalization (batch-wise)
- instance normalization (per channel)
- group normalization (channel groups)
- RMS normalization (variance-only)
Each addresses different constraints.
Related Concepts
- Architecture & Representation
- Normalization Layers
- Residual Connections
- Optimization Stability
- Learning Rate Warmup
- Transformers
- Batch Normalization