Teacher Forcing

Short Definition

Teacher forcing is a training technique for sequence models in which the true previous output is used as input at the next time step instead of the model’s own prediction.

Definition

Teacher forcing is a supervised training strategy used in recurrent neural networks (RNNs), LSTMs, GRUs, and sequence-to-sequence models. During training, instead of feeding the model’s predicted output back into the next time step, the ground-truth target is used. This stabilizes training and accelerates convergence.

The model is guided by the correct answer during learning.

Why It Matters

Sequence generation models must learn to predict the next token based on previous outputs. Without teacher forcing:

early prediction errors compound
training becomes unstable
convergence slows dramatically

Teacher forcing reduces cascading error during training.

Core Mechanism

At time step ( t ):

Without teacher forcing:

Input_t = Model_Output_{t-1}

With teacher forcing:

Input_t = Ground_Truth_{t-1}

The model learns under idealized conditions.

Minimal Conceptual Illustration

			
Training:
y₁ (true) → Model → y₂_pred
y₂ (true) → Model → y₃_pred
Inference:
y₁_pred → Model → y₂_pred
y₂_pred → Model → y₃_pred

		

Training and inference differ.

Benefits

Teacher forcing:

accelerates training convergence
stabilizes gradient flow
reduces early-stage noise
improves short-term prediction learning

It simplifies the optimization problem.

Exposure Bias

A key limitation is exposure bias:

During training: model sees correct previous tokens
During inference: model sees its own predictions

This mismatch can cause performance degradation at test time.

The model is never trained on its own mistakes.

Scheduled Sampling

To mitigate exposure bias, scheduled sampling gradually replaces ground-truth inputs with model predictions during training.

Example:

			
With probability p → use ground truth
With probability (1-p) → use model output

Training gradually matches inference conditions.

Relationship to BPTT

Teacher forcing does not replace BPTT; it modifies the input sequence during forward passes. Gradients are still computed using Backpropagation Through Time.

It affects data flow, not gradient computation.

Role in Sequence-to-Sequence Models

Teacher forcing is especially important in:

machine translation
text generation
speech recognition
time-series forecasting

It was foundational in early Seq2Seq systems.

Practical Considerations

When using teacher forcing:

consider partial or scheduled sampling
monitor divergence between training and validation
evaluate performance under autoregressive inference
avoid relying solely on teacher-forced validation metrics

Inference must be tested realistically.

Common Pitfalls

assuming teacher forcing guarantees good inference performance
ignoring exposure bias
evaluating models only under teacher-forced conditions
not testing autoregressive rollout

Training convenience can mask inference weakness.

Teacher Forcing vs Autoregressive Training

Aspect	Teacher Forcing	Pure Autoregressive
Stability	High	Lower
Convergence speed	Faster	Slower
Exposure bias risk	Yes	No
Training realism	Lower	Higher

Trade-off between stability and realism.

Summary Characteristics

Aspect	Teacher Forcing
Purpose	Stabilize sequence training
Risk	Exposure bias
Common use	Seq2Seq models
Training–Inference gap	Present
Mitigation	Scheduled sampling

Related Concepts

Training & Optimization
Recurrent Neural Network (RNN)
Long Short-Term Memory (LSTM)
Gated Recurrent Unit (GRU)
Backpropagation Through Time (BPTT)
Sequence-to-Sequence Models (Seq2Seq)
Scheduled Sampling
Exposure Bias