Short Definition
Teacher forcing is a training technique for sequence models in which the true previous output is used as input at the next time step instead of the model’s own prediction.
Definition
Teacher forcing is a supervised training strategy used in recurrent neural networks (RNNs), LSTMs, GRUs, and sequence-to-sequence models. During training, instead of feeding the model’s predicted output back into the next time step, the ground-truth target is used. This stabilizes training and accelerates convergence.
The model is guided by the correct answer during learning.
Why It Matters
Sequence generation models must learn to predict the next token based on previous outputs. Without teacher forcing:
- early prediction errors compound
- training becomes unstable
- convergence slows dramatically
Teacher forcing reduces cascading error during training.
Core Mechanism
At time step ( t ):
Without teacher forcing:
Input_t = Model_Output_{t-1}
With teacher forcing:
Input_t = Ground_Truth_{t-1}
The model learns under idealized conditions.
Minimal Conceptual Illustration
Training:y₁ (true) → Model → y₂_predy₂ (true) → Model → y₃_predInference:y₁_pred → Model → y₂_predy₂_pred → Model → y₃_pred
Training and inference differ.
Benefits
Teacher forcing:
- accelerates training convergence
- stabilizes gradient flow
- reduces early-stage noise
- improves short-term prediction learning
It simplifies the optimization problem.
Exposure Bias
A key limitation is exposure bias:
- During training: model sees correct previous tokens
- During inference: model sees its own predictions
This mismatch can cause performance degradation at test time.
The model is never trained on its own mistakes.
Scheduled Sampling
To mitigate exposure bias, scheduled sampling gradually replaces ground-truth inputs with model predictions during training.
Example:
With probability p → use ground truthWith probability (1-p) → use model output
Training gradually matches inference conditions.
Relationship to BPTT
Teacher forcing does not replace BPTT; it modifies the input sequence during forward passes. Gradients are still computed using Backpropagation Through Time.
It affects data flow, not gradient computation.
Role in Sequence-to-Sequence Models
Teacher forcing is especially important in:
- machine translation
- text generation
- speech recognition
- time-series forecasting
It was foundational in early Seq2Seq systems.
Practical Considerations
When using teacher forcing:
- consider partial or scheduled sampling
- monitor divergence between training and validation
- evaluate performance under autoregressive inference
- avoid relying solely on teacher-forced validation metrics
Inference must be tested realistically.
Common Pitfalls
- assuming teacher forcing guarantees good inference performance
- ignoring exposure bias
- evaluating models only under teacher-forced conditions
- not testing autoregressive rollout
Training convenience can mask inference weakness.
Teacher Forcing vs Autoregressive Training
| Aspect | Teacher Forcing | Pure Autoregressive |
|---|---|---|
| Stability | High | Lower |
| Convergence speed | Faster | Slower |
| Exposure bias risk | Yes | No |
| Training realism | Lower | Higher |
Trade-off between stability and realism.
Summary Characteristics
| Aspect | Teacher Forcing |
|---|---|
| Purpose | Stabilize sequence training |
| Risk | Exposure bias |
| Common use | Seq2Seq models |
| Training–Inference gap | Present |
| Mitigation | Scheduled sampling |
Related Concepts
- Training & Optimization
- Recurrent Neural Network (RNN)
- Long Short-Term Memory (LSTM)
- Gated Recurrent Unit (GRU)
- Backpropagation Through Time (BPTT)
- Sequence-to-Sequence Models (Seq2Seq)
- Scheduled Sampling
- Exposure Bias