Short Definition
Exploding gradients occur when gradient magnitudes grow uncontrollably during backpropagation.
Definition
Exploding gradients refer to a training instability in which gradient values increase exponentially as they are propagated backward through a network. This causes excessively large parameter updates that can destabilize optimization, lead to numerical overflow, or cause the model to diverge.
Learning becomes unstable when gradients explode.
Why It Matters
When gradients explode, training becomes erratic or fails entirely. Loss values may spike, parameters may become NaN or infinite, and optimization can break even with small learning rates. Exploding gradients are especially common in deep or recurrent architectures.
They are a primary cause of sudden training failure.
Why Exploding Gradients Occur
Common causes include:
- deep network depth
- repeated multiplication by large derivatives
- poor weight initialization
- long unrolled sequences in recurrent models
- high learning rates
- hard example mining or noisy data
The effect compounds exponentially.
Minimal Conceptual Illustration
large derivative × large derivative × … → infinity
Exploding vs Vanishing Gradients
- Exploding gradients: updates grow uncontrollably
- Vanishing gradients: updates shrink toward zero
Both arise from unstable gradient propagation.
Effects on Training
Exploding gradients can lead to:
- unstable or oscillating loss
- numerical overflow (NaN/Inf values)
- sensitivity to learning rate
- abrupt divergence after apparent progress
Training may fail suddenly.
Common Mitigation Strategies
Mitigation techniques include:
- gradient clipping
- careful weight initialization
- normalization layers
- smaller learning rates
- architectural changes (e.g., residual connections)
- truncated backpropagation in recurrent models
Gradient clipping is the most direct defense.
Relationship to Gradient Variance
High gradient variance increases the likelihood of extreme gradient values. Exploding gradients represent the upper tail of this distribution.
Variance management reduces explosion risk.
Relationship to Optimization Stability
Exploding gradients directly threaten optimization stability. Most stability techniques aim to prevent or limit gradient explosions.
Stability is a systems-level property.
Relationship to Generalization
Exploding gradients primarily affect optimization rather than generalization. However, unstable training can prevent convergence to useful solutions, indirectly harming generalization.
Common Pitfalls
- relying solely on learning rate reduction
- ignoring gradient diagnostics
- combining hard example mining with high learning rates
- assuming adaptive optimizers eliminate explosion
- failing to monitor gradient norms
Explosions are often predictable.
Monitoring and Detection
Useful indicators include:
- sudden spikes in loss
- gradient norm blow-ups
- NaN or Inf parameters
- frequent clipping activation
Monitoring enables early intervention.
Related Concepts
- Training & Optimization
- Vanishing Gradients
- Gradient Clipping
- Gradient Variance
- Optimization Stability
- Weight Initialization
- Recurrent Neural Networks