Scheduled Sampling (Deep Dive)

Short Definition

Scheduled sampling is a training strategy that gradually replaces ground-truth inputs with model-generated predictions to reduce exposure bias in sequence models.

Definition

Scheduled sampling is a curriculum-based training technique designed to mitigate exposure bias in autoregressive models. Instead of always feeding the true previous token during training (as in teacher forcing), the model is probabilistically fed its own predictions. The probability of using model predictions increases over time according to a predefined schedule.

Training progressively matches inference conditions.

Why It Matters

Teacher forcing stabilizes training but creates a mismatch between training and inference. During inference, the model conditions on its own predictions, which may contain errors. Scheduled sampling reduces this discrepancy by exposing the model to imperfect contexts during training.

The model learns to recover from itself.

Core Mechanism

At each time step during training:


With probability ε → use model prediction
With probability (1 − ε) → use ground truth

The parameter ε increases over training.

Early training:

Mostly ground truth

Late training:

Mostly model predictions

Guidance gradually decreases.

Minimal Conceptual Illustration

Step 1: GroundTruth → Model
Step 2: Mixed input (GT or Pred) → Model
Step 3: Mostly Predicted → Model

Control shifts from teacher to student.

Scheduling Strategies

Common schedules for ε include:

Linear Increase

ε_t = min(1, k * t)

Exponential Decay

ε_t = 1 - exp(-k * t)

Inverse Sigmoid

ε_t = k / (k + exp(t/k))

Choice of schedule influences stability.

Relationship to Exposure Bias

Exposure bias arises because:

P_train(context) ≠ P_inference(context)

Scheduled sampling reduces this gap by aligning training distribution with inference distribution.

Distribution alignment improves robustness.

Trade-offs

BenefitCost
Reduces exposure biasTraining becomes noisier
Improves inference realismOptimization becomes harder
Encourages recovery behaviorMay destabilize early training

Stability and realism must be balanced.

Theoretical Caveat

Scheduled sampling introduces bias in gradient estimation because the sampling decision depends on model predictions. This can lead to inconsistent training objectives.

It is a heuristic, not a fully principled solution.

Alternatives

  • Professor Forcing
  • Reinforcement learning fine-tuning
  • Sequence-level objectives
  • Data noising techniques
  • Self-critical training

Each addresses training–inference mismatch differently.

Practical Considerations

When applying scheduled sampling:

  • Start with low ε to preserve stability.
  • Increase gradually.
  • Monitor validation under full autoregressive rollout.
  • Avoid aggressive early switching.

Too much noise too early harms convergence.

Common Pitfalls

  • Switching too quickly from teacher forcing
  • Evaluating under teacher forcing instead of full rollout
  • Assuming scheduled sampling eliminates exposure bias entirely
  • Ignoring instability in long sequences

Mitigation does not mean elimination.

Scheduled Sampling vs Teacher Forcing

AspectTeacher ForcingScheduled Sampling
Training stabilityHighModerate
Inference realismLowHigher
Exposure biasPresentReduced
ComplexitySimpleMore complex

Scheduled sampling introduces controlled realism.

Summary Characteristics

AspectScheduled Sampling
PurposeReduce exposure bias
StrategyGradual replacement of ground truth
RiskOptimization instability
BenefitBetter inference robustness
DomainAutoregressive models

Related Concepts

  • Training & Optimization
  • Teacher Forcing
  • Exposure Bias
  • Sequence-to-Sequence Models (Seq2Seq)
  • Autoregressive Models
  • Reinforcement Learning Fine-Tuning
  • Evaluation Protocols