Short Definition
Warmup schedules gradually increase the learning rate at the start of training.
Definition
Warmup schedules are learning rate strategies where the learning rate starts at a small value and increases over an initial training phase. Warmup helps stabilize early training, especially for large models or large batch sizes.
Warmup is commonly used in transformer-based architectures.
Why It Matters
At the start of training:
- parameters are uncalibrated
- gradients can be unstable
- large updates may destabilize learning
Warmup reduces early instability and improves convergence.
How It Works (Conceptually)
- Training begins with a small learning rate
- Learning rate increases over several steps or epochs
- Normal scheduling begins after warmup
Warmup moderates early gradient updates.
Minimal Python Example
Python
lr = base_lr * (step / warmup_steps)
Common Pitfalls
- Using warmup unnecessarily for small models
- Warmup periods that are too long
- Forgetting to switch to the main schedule
- Assuming warmup fixes all instability issues
Related Concepts
- Learning Rate Schedules
- Training Instability
- Optimization
- Large-Batch Training
- Convergence