Scheduled Sampling (Deep Dive)

Short Definition

Scheduled sampling is a training strategy that gradually replaces ground-truth inputs with model-generated predictions to reduce exposure bias in sequence models.

Definition

Scheduled sampling is a curriculum-based training technique designed to mitigate exposure bias in autoregressive models. Instead of always feeding the true previous token during training (as in teacher forcing), the model is probabilistically fed its own predictions. The probability of using model predictions increases over time according to a predefined schedule.

Training progressively matches inference conditions.

Why It Matters

Teacher forcing stabilizes training but creates a mismatch between training and inference. During inference, the model conditions on its own predictions, which may contain errors. Scheduled sampling reduces this discrepancy by exposing the model to imperfect contexts during training.

The model learns to recover from itself.

Core Mechanism

At each time step during training:

With probability ε → use model prediction
With probability (1 − ε) → use ground truth

The parameter ε increases over training.

Early training:

Mostly ground truth

Late training:

Mostly model predictions

Guidance gradually decreases.

Minimal Conceptual Illustration

			
Step 1:  GroundTruth → Model
Step 2:  Mixed input (GT or Pred) → Model
Step 3:  Mostly Predicted → Model

Control shifts from teacher to student.

Scheduling Strategies

Common schedules for ε include:

Linear Increase

ε_t = min(1, k * t)

Exponential Decay

ε_t = 1 - exp(-k * t)

Inverse Sigmoid

ε_t = k / (k + exp(t/k))

Choice of schedule influences stability.

Relationship to Exposure Bias

Exposure bias arises because:

P_train(context) ≠ P_inference(context)

Scheduled sampling reduces this gap by aligning training distribution with inference distribution.

Distribution alignment improves robustness.

Trade-offs

Benefit	Cost
Reduces exposure bias	Training becomes noisier
Improves inference realism	Optimization becomes harder
Encourages recovery behavior	May destabilize early training

Stability and realism must be balanced.

Theoretical Caveat

Scheduled sampling introduces bias in gradient estimation because the sampling decision depends on model predictions. This can lead to inconsistent training objectives.

It is a heuristic, not a fully principled solution.

Alternatives

Professor Forcing
Reinforcement learning fine-tuning
Sequence-level objectives
Data noising techniques
Self-critical training

Each addresses training–inference mismatch differently.

Practical Considerations

When applying scheduled sampling:

Start with low ε to preserve stability.
Increase gradually.
Monitor validation under full autoregressive rollout.
Avoid aggressive early switching.

Too much noise too early harms convergence.

Common Pitfalls

Switching too quickly from teacher forcing
Evaluating under teacher forcing instead of full rollout
Assuming scheduled sampling eliminates exposure bias entirely
Ignoring instability in long sequences

Mitigation does not mean elimination.

Scheduled Sampling vs Teacher Forcing

Aspect	Teacher Forcing	Scheduled Sampling
Training stability	High	Moderate
Inference realism	Low	Higher
Exposure bias	Present	Reduced
Complexity	Simple	More complex

Scheduled sampling introduces controlled realism.

Summary Characteristics

Aspect	Scheduled Sampling
Purpose	Reduce exposure bias
Strategy	Gradual replacement of ground truth
Risk	Optimization instability
Benefit	Better inference robustness
Domain	Autoregressive models

Related Concepts

Training & Optimization
Teacher Forcing
Exposure Bias
Sequence-to-Sequence Models (Seq2Seq)
Autoregressive Models
Reinforcement Learning Fine-Tuning
Evaluation Protocols