Warmup Schedules

Short Definition

Warmup schedules gradually increase the learning rate at the start of training.

Definition

Warmup schedules are learning rate strategies where the learning rate starts at a small value and increases over an initial training phase. Warmup helps stabilize early training, especially for large models or large batch sizes.

Warmup is commonly used in transformer-based architectures.

Why It Matters

At the start of training:

  • parameters are uncalibrated
  • gradients can be unstable
  • large updates may destabilize learning

Warmup reduces early instability and improves convergence.

How It Works (Conceptually)

  • Training begins with a small learning rate
  • Learning rate increases over several steps or epochs
  • Normal scheduling begins after warmup

Warmup moderates early gradient updates.

Minimal Python Example

Python
lr = base_lr * (step / warmup_steps)

Common Pitfalls

  • Using warmup unnecessarily for small models
  • Warmup periods that are too long
  • Forgetting to switch to the main schedule
  • Assuming warmup fixes all instability issues

Related Concepts

  • Learning Rate Schedules
  • Training Instability
  • Optimization
  • Large-Batch Training
  • Convergence