Short Definition
Weight initialization sets the starting values of model parameters before training begins.
Definition
Weight initialization is the process of assigning initial values to a neural network’s weights prior to optimization. These initial values strongly influence gradient flow, optimization stability, convergence speed, and the ability of deep networks to learn meaningful representations.
Initialization defines the starting point of learning.
Why It Matters
Poor initialization can cause vanishing or exploding gradients, slow convergence, or complete training failure. Good initialization preserves signal magnitude across layers, enabling stable backpropagation and efficient optimization—especially in deep networks.
Initialization is foundational to trainability.
What Weight Initialization Affects
Initialization impacts:
- gradient magnitude and variance
- optimization stability
- convergence speed
- sensitivity to learning rate
- effectiveness of normalization layers
- depth scalability
Many training pathologies originate at initialization.
Common Weight Initialization Strategies
Widely used strategies include:
- Random initialization: small random values (baseline)
- Xavier (Glorot) initialization: variance-scaled for tanh/sigmoid
- He initialization: variance-scaled for ReLU and variants
- Orthogonal initialization: preserves norm across layers
- Zero initialization: generally invalid for weights (symmetry breaking)
Modern initializations are variance-aware.
Minimal Conceptual Example
# conceptual variance-scaled initializationweights ~ Normal(0, sqrt(2 / fan_in))
Xavier vs He Initialization
- Xavier initialization
- balances variance for forward and backward passes
- suited for tanh and sigmoid activations
- He initialization
- preserves variance for ReLU-like activations
- allows deeper networks with non-saturating activations
Activation choice determines initialization choice.
Relationship to Vanishing and Exploding Gradients
Initialization directly controls gradient propagation:
- too small → vanishing gradients
- too large → exploding gradients
Variance-scaled initialization aims to keep gradients in a stable range.
Interaction with Normalization Layers
Normalization layers reduce sensitivity to initialization but do not eliminate its importance. Extremely poor initialization can still destabilize training even with normalization.
Initialization and normalization are complementary.
Interaction with Residual Connections
Residual connections mitigate—but do not remove—the need for proper initialization. Stable residual learning still relies on reasonable initial parameter scales.
Residuals enable depth; initialization enables learning.
Relationship to Optimization Stability
Good initialization improves optimization stability by:
- reducing gradient variance
- enabling larger learning rates
- reducing early training instability
- lowering reliance on clipping or warmup
Initialization shapes early training dynamics.
Effects on Generalization
Weight initialization primarily affects optimization rather than generalization directly. However, stable and efficient optimization can lead to better representations that generalize more reliably.
Generalization effects are indirect.
Common Pitfalls
- using zero initialization for weights
- mismatching initialization to activation functions
- assuming normalization makes initialization irrelevant
- reusing initialization defaults blindly
- ignoring initialization when diagnosing instability
Initialization mistakes are subtle but costly.
Relationship to Reproducibility
Initialization introduces randomness into training. Reproducible experiments must control initialization seeds and document initialization schemes explicitly.
Initialization is part of the experimental setup.