Weight Initialization

Short Definition

Weight initialization sets the starting values of model parameters before training begins.

Definition

Weight initialization is the process of assigning initial values to a neural network’s weights prior to optimization. These initial values strongly influence gradient flow, optimization stability, convergence speed, and the ability of deep networks to learn meaningful representations.

Initialization defines the starting point of learning.

Why It Matters

Poor initialization can cause vanishing or exploding gradients, slow convergence, or complete training failure. Good initialization preserves signal magnitude across layers, enabling stable backpropagation and efficient optimization—especially in deep networks.

Initialization is foundational to trainability.

What Weight Initialization Affects

Initialization impacts:

  • gradient magnitude and variance
  • optimization stability
  • convergence speed
  • sensitivity to learning rate
  • effectiveness of normalization layers
  • depth scalability

Many training pathologies originate at initialization.

Common Weight Initialization Strategies

Widely used strategies include:

  • Random initialization: small random values (baseline)
  • Xavier (Glorot) initialization: variance-scaled for tanh/sigmoid
  • He initialization: variance-scaled for ReLU and variants
  • Orthogonal initialization: preserves norm across layers
  • Zero initialization: generally invalid for weights (symmetry breaking)

Modern initializations are variance-aware.

Minimal Conceptual Example

# conceptual variance-scaled initialization
weights ~ Normal(0, sqrt(2 / fan_in))

Xavier vs He Initialization

  • Xavier initialization
    • balances variance for forward and backward passes
    • suited for tanh and sigmoid activations
  • He initialization
    • preserves variance for ReLU-like activations
    • allows deeper networks with non-saturating activations

Activation choice determines initialization choice.

Relationship to Vanishing and Exploding Gradients

Initialization directly controls gradient propagation:

  • too small → vanishing gradients
  • too large → exploding gradients

Variance-scaled initialization aims to keep gradients in a stable range.

Interaction with Normalization Layers

Normalization layers reduce sensitivity to initialization but do not eliminate its importance. Extremely poor initialization can still destabilize training even with normalization.

Initialization and normalization are complementary.

Interaction with Residual Connections

Residual connections mitigate—but do not remove—the need for proper initialization. Stable residual learning still relies on reasonable initial parameter scales.

Residuals enable depth; initialization enables learning.

Relationship to Optimization Stability

Good initialization improves optimization stability by:

  • reducing gradient variance
  • enabling larger learning rates
  • reducing early training instability
  • lowering reliance on clipping or warmup

Initialization shapes early training dynamics.

Effects on Generalization

Weight initialization primarily affects optimization rather than generalization directly. However, stable and efficient optimization can lead to better representations that generalize more reliably.

Generalization effects are indirect.

Common Pitfalls

  • using zero initialization for weights
  • mismatching initialization to activation functions
  • assuming normalization makes initialization irrelevant
  • reusing initialization defaults blindly
  • ignoring initialization when diagnosing instability

Initialization mistakes are subtle but costly.

Relationship to Reproducibility

Initialization introduces randomness into training. Reproducible experiments must control initialization seeds and document initialization schemes explicitly.

Initialization is part of the experimental setup.

Related Concepts