Exploding Gradients

Short Definition

Exploding gradients occur when gradient magnitudes grow uncontrollably during backpropagation.

Definition

Exploding gradients refer to a training instability in which gradient values increase exponentially as they are propagated backward through a network. This causes excessively large parameter updates that can destabilize optimization, lead to numerical overflow, or cause the model to diverge.

Learning becomes unstable when gradients explode.

Why It Matters

When gradients explode, training becomes erratic or fails entirely. Loss values may spike, parameters may become NaN or infinite, and optimization can break even with small learning rates. Exploding gradients are especially common in deep or recurrent architectures.

They are a primary cause of sudden training failure.

Why Exploding Gradients Occur

Common causes include:

  • deep network depth
  • repeated multiplication by large derivatives
  • poor weight initialization
  • long unrolled sequences in recurrent models
  • high learning rates
  • hard example mining or noisy data

The effect compounds exponentially.

Minimal Conceptual Illustration

large derivative × large derivative × … → infinity

Exploding vs Vanishing Gradients

  • Exploding gradients: updates grow uncontrollably
  • Vanishing gradients: updates shrink toward zero

Both arise from unstable gradient propagation.

Effects on Training

Exploding gradients can lead to:

  • unstable or oscillating loss
  • numerical overflow (NaN/Inf values)
  • sensitivity to learning rate
  • abrupt divergence after apparent progress

Training may fail suddenly.

Common Mitigation Strategies

Mitigation techniques include:

  • gradient clipping
  • careful weight initialization
  • normalization layers
  • smaller learning rates
  • architectural changes (e.g., residual connections)
  • truncated backpropagation in recurrent models

Gradient clipping is the most direct defense.

Relationship to Gradient Variance

High gradient variance increases the likelihood of extreme gradient values. Exploding gradients represent the upper tail of this distribution.

Variance management reduces explosion risk.

Relationship to Optimization Stability

Exploding gradients directly threaten optimization stability. Most stability techniques aim to prevent or limit gradient explosions.

Stability is a systems-level property.

Relationship to Generalization

Exploding gradients primarily affect optimization rather than generalization. However, unstable training can prevent convergence to useful solutions, indirectly harming generalization.

Common Pitfalls

  • relying solely on learning rate reduction
  • ignoring gradient diagnostics
  • combining hard example mining with high learning rates
  • assuming adaptive optimizers eliminate explosion
  • failing to monitor gradient norms

Explosions are often predictable.

Monitoring and Detection

Useful indicators include:

  • sudden spikes in loss
  • gradient norm blow-ups
  • NaN or Inf parameters
  • frequent clipping activation

Monitoring enables early intervention.

Related Concepts

  • Training & Optimization
  • Vanishing Gradients
  • Gradient Clipping
  • Gradient Variance
  • Optimization Stability
  • Weight Initialization
  • Recurrent Neural Networks