Teacher Forcing

Short Definition

Teacher forcing is a training technique for sequence models in which the true previous output is used as input at the next time step instead of the model’s own prediction.

Definition

Teacher forcing is a supervised training strategy used in recurrent neural networks (RNNs), LSTMs, GRUs, and sequence-to-sequence models. During training, instead of feeding the model’s predicted output back into the next time step, the ground-truth target is used. This stabilizes training and accelerates convergence.

The model is guided by the correct answer during learning.

Why It Matters

Sequence generation models must learn to predict the next token based on previous outputs. Without teacher forcing:

  • early prediction errors compound
  • training becomes unstable
  • convergence slows dramatically

Teacher forcing reduces cascading error during training.

Core Mechanism

At time step ( t ):

Without teacher forcing:

Input_t = Model_Output_{t-1}

With teacher forcing:

Input_t = Ground_Truth_{t-1}

The model learns under idealized conditions.

Minimal Conceptual Illustration

Training:
y₁ (true) → Model → y₂_pred
y₂ (true) → Model → y₃_pred
Inference:
y₁_pred → Model → y₂_pred
y₂_pred → Model → y₃_pred

Training and inference differ.

Benefits

Teacher forcing:

  • accelerates training convergence
  • stabilizes gradient flow
  • reduces early-stage noise
  • improves short-term prediction learning

It simplifies the optimization problem.

Exposure Bias

A key limitation is exposure bias:

  • During training: model sees correct previous tokens
  • During inference: model sees its own predictions

This mismatch can cause performance degradation at test time.

The model is never trained on its own mistakes.

Scheduled Sampling

To mitigate exposure bias, scheduled sampling gradually replaces ground-truth inputs with model predictions during training.

Example:

With probability p → use ground truth
With probability (1-p) → use model output

Training gradually matches inference conditions.

Relationship to BPTT

Teacher forcing does not replace BPTT; it modifies the input sequence during forward passes. Gradients are still computed using Backpropagation Through Time.

It affects data flow, not gradient computation.

Role in Sequence-to-Sequence Models

Teacher forcing is especially important in:

  • machine translation
  • text generation
  • speech recognition
  • time-series forecasting

It was foundational in early Seq2Seq systems.

Practical Considerations

When using teacher forcing:

  • consider partial or scheduled sampling
  • monitor divergence between training and validation
  • evaluate performance under autoregressive inference
  • avoid relying solely on teacher-forced validation metrics

Inference must be tested realistically.

Common Pitfalls

  • assuming teacher forcing guarantees good inference performance
  • ignoring exposure bias
  • evaluating models only under teacher-forced conditions
  • not testing autoregressive rollout

Training convenience can mask inference weakness.

Teacher Forcing vs Autoregressive Training

AspectTeacher ForcingPure Autoregressive
StabilityHighLower
Convergence speedFasterSlower
Exposure bias riskYesNo
Training realismLowerHigher

Trade-off between stability and realism.

Summary Characteristics

AspectTeacher Forcing
PurposeStabilize sequence training
RiskExposure bias
Common useSeq2Seq models
Training–Inference gapPresent
MitigationScheduled sampling

Related Concepts

  • Training & Optimization
  • Recurrent Neural Network (RNN)
  • Long Short-Term Memory (LSTM)
  • Gated Recurrent Unit (GRU)
  • Backpropagation Through Time (BPTT)
  • Sequence-to-Sequence Models (Seq2Seq)
  • Scheduled Sampling
  • Exposure Bias