Short Definition
Scheduled sampling is a training strategy that gradually replaces ground-truth inputs with model-generated predictions to reduce exposure bias in sequence models.
Definition
Scheduled sampling is a curriculum-based training technique designed to mitigate exposure bias in autoregressive models. Instead of always feeding the true previous token during training (as in teacher forcing), the model is probabilistically fed its own predictions. The probability of using model predictions increases over time according to a predefined schedule.
Training progressively matches inference conditions.
Why It Matters
Teacher forcing stabilizes training but creates a mismatch between training and inference. During inference, the model conditions on its own predictions, which may contain errors. Scheduled sampling reduces this discrepancy by exposing the model to imperfect contexts during training.
The model learns to recover from itself.
Core Mechanism
At each time step during training:
With probability ε → use model prediction
With probability (1 − ε) → use ground truth
The parameter ε increases over training.
Early training:
Mostly ground truth
Late training:
Mostly model predictions
Guidance gradually decreases.
Minimal Conceptual Illustration
Step 1: GroundTruth → ModelStep 2: Mixed input (GT or Pred) → ModelStep 3: Mostly Predicted → Model
Control shifts from teacher to student.
Scheduling Strategies
Common schedules for ε include:
Linear Increase
ε_t = min(1, k * t)
Exponential Decay
ε_t = 1 - exp(-k * t)
Inverse Sigmoid
ε_t = k / (k + exp(t/k))
Choice of schedule influences stability.
Relationship to Exposure Bias
Exposure bias arises because:
P_train(context) ≠ P_inference(context)
Scheduled sampling reduces this gap by aligning training distribution with inference distribution.
Distribution alignment improves robustness.
Trade-offs
| Benefit | Cost |
|---|---|
| Reduces exposure bias | Training becomes noisier |
| Improves inference realism | Optimization becomes harder |
| Encourages recovery behavior | May destabilize early training |
Stability and realism must be balanced.
Theoretical Caveat
Scheduled sampling introduces bias in gradient estimation because the sampling decision depends on model predictions. This can lead to inconsistent training objectives.
It is a heuristic, not a fully principled solution.
Alternatives
- Professor Forcing
- Reinforcement learning fine-tuning
- Sequence-level objectives
- Data noising techniques
- Self-critical training
Each addresses training–inference mismatch differently.
Practical Considerations
When applying scheduled sampling:
- Start with low ε to preserve stability.
- Increase gradually.
- Monitor validation under full autoregressive rollout.
- Avoid aggressive early switching.
Too much noise too early harms convergence.
Common Pitfalls
- Switching too quickly from teacher forcing
- Evaluating under teacher forcing instead of full rollout
- Assuming scheduled sampling eliminates exposure bias entirely
- Ignoring instability in long sequences
Mitigation does not mean elimination.
Scheduled Sampling vs Teacher Forcing
| Aspect | Teacher Forcing | Scheduled Sampling |
|---|---|---|
| Training stability | High | Moderate |
| Inference realism | Low | Higher |
| Exposure bias | Present | Reduced |
| Complexity | Simple | More complex |
Scheduled sampling introduces controlled realism.
Summary Characteristics
| Aspect | Scheduled Sampling |
|---|---|
| Purpose | Reduce exposure bias |
| Strategy | Gradual replacement of ground truth |
| Risk | Optimization instability |
| Benefit | Better inference robustness |
| Domain | Autoregressive models |
Related Concepts
- Training & Optimization
- Teacher Forcing
- Exposure Bias
- Sequence-to-Sequence Models (Seq2Seq)
- Autoregressive Models
- Reinforcement Learning Fine-Tuning
- Evaluation Protocols