Reward Model Collapse

Short Definition

Reward Model Collapse occurs when a learned reward model in RLHF degenerates into assigning uniformly high (or low) rewards, losing meaningful discrimination between outputs.

It destroys the reward signal and destabilizes policy optimization.

Definition

In RLHF, a reward model ( R_\phi(x, y) ) is trained to predict human preferences between outputs.

Ideally:

[
R_\phi(x, y_{preferred}) > R_\phi(x, y_{rejected})
]

Reward Model Collapse occurs when:

  • Reward scores saturate.
  • Differences between outputs shrink or explode.
  • The model assigns near-constant values.
  • Ranking signal degrades.

As a result, policy optimization loses meaningful gradient guidance.

Core Failure Mode

Collapse typically manifests as:

  1. Reward Saturation
  • All outputs receive very high scores.
  • Little differentiation.
  1. Reward Flattening
  • Reward variance approaches zero.
  • Model cannot rank outputs effectively.
  1. Reward Explosion
  • Magnitudes grow uncontrollably.
  • Numerical instability occurs.

In all cases, reward ceases to be informative.

Minimal Conceptual Illustration


Healthy reward:
Output A = 2.1
Output B = 0.8

Collapsed reward:
Output A = 9.99
Output B = 9.98

No meaningful separation.

Causes

  1. Over-optimization of reward model.
  2. Poor regularization.
  3. Dataset imbalance.
  4. Feedback loops with policy.
  5. Excessive training epochs.
  6. Narrow distribution of prompts.

Collapse often emerges in later training stages.

Feedback Loop Dynamics

Policy optimization and reward model training may create feedback loops:

  1. Policy produces outputs.
  2. Reward model scores them.
  3. Policy optimizes toward reward.
  4. Reward model retrained on policy outputs.

If unchecked:

  • Distribution narrows.
  • Reward diversity decreases.
  • Collapse becomes more likely.

This is a systemic instability.

Interaction with PPO

If reward collapses:

  • Advantage estimates degrade.
  • Gradients become noisy or meaningless.
  • Policy training becomes unstable.
  • KL penalty may dominate.

PPO assumes informative reward gradients.

Collapse breaks that assumption.


Distinction from Reward Model Overfitting

Overfitting:

  • Memorizes preference dataset.
  • Fails to generalize.

Collapse:

  • Loses ranking resolution entirely.
  • Produces degenerate reward outputs.

Both are severe, but collapse is more catastrophic.

Scaling Context

Large models:

  • More powerful policy optimization.
  • Greater ability to exploit reward weaknesses.
  • Increased pressure on reward model stability.

As model capability increases, reward collapse risk increases.

Reward model quality becomes scaling bottleneck.

Alignment Implications

Reward collapse can lead to:

  • Policy stagnation.
  • Unpredictable behavior.
  • Exploitative but meaningless outputs.
  • Alignment regression.

It undermines the feedback mechanism.

Reward signal integrity is central to alignment.

Governance Perspective

Mitigation requires:

  • Reward variance monitoring.
  • Validation across diverse prompt sets.
  • Adversarial stress testing.
  • Periodic reward recalibration.
  • Independent evaluation datasets.

Reward collapse should trigger training audits.

Mitigation Strategies

  • Early stopping.
  • Regularization.
  • Larger and more diverse preference data.
  • KL-anchored policy training.
  • Ensemble reward models.
  • Uncertainty estimation in reward.

Diversity preservation reduces collapse risk.

Summary

Reward Model Collapse:

  • Degeneration of reward signal.
  • Loss of meaningful ranking.
  • Destabilizes policy optimization.
  • Often emerges via feedback loops.
  • Critical failure mode in RLHF systems.

Stable alignment requires robust reward modeling.

Related Concepts

  • Reward Model Overfitting
  • Reinforcement Learning from Human Feedback (RLHF)
  • KL Penalty in RLHF
  • Policy Collapse
  • Reward Hacking
  • Preference Drift
  • Alignment Fragility
  • Evaluation Governance