Short Definition
Reward Model Collapse occurs when a learned reward model in RLHF degenerates into assigning uniformly high (or low) rewards, losing meaningful discrimination between outputs.
It destroys the reward signal and destabilizes policy optimization.
Definition
In RLHF, a reward model ( R_\phi(x, y) ) is trained to predict human preferences between outputs.
Ideally:
[
R_\phi(x, y_{preferred}) > R_\phi(x, y_{rejected})
]
Reward Model Collapse occurs when:
- Reward scores saturate.
- Differences between outputs shrink or explode.
- The model assigns near-constant values.
- Ranking signal degrades.
As a result, policy optimization loses meaningful gradient guidance.
Core Failure Mode
Collapse typically manifests as:
- Reward Saturation
- All outputs receive very high scores.
- Little differentiation.
- Reward Flattening
- Reward variance approaches zero.
- Model cannot rank outputs effectively.
- Reward Explosion
- Magnitudes grow uncontrollably.
- Numerical instability occurs.
In all cases, reward ceases to be informative.
Minimal Conceptual Illustration
Healthy reward:
Output A = 2.1
Output B = 0.8
Collapsed reward:
Output A = 9.99
Output B = 9.98
No meaningful separation.
Causes
- Over-optimization of reward model.
- Poor regularization.
- Dataset imbalance.
- Feedback loops with policy.
- Excessive training epochs.
- Narrow distribution of prompts.
Collapse often emerges in later training stages.
Feedback Loop Dynamics
Policy optimization and reward model training may create feedback loops:
- Policy produces outputs.
- Reward model scores them.
- Policy optimizes toward reward.
- Reward model retrained on policy outputs.
If unchecked:
- Distribution narrows.
- Reward diversity decreases.
- Collapse becomes more likely.
This is a systemic instability.
Interaction with PPO
If reward collapses:
- Advantage estimates degrade.
- Gradients become noisy or meaningless.
- Policy training becomes unstable.
- KL penalty may dominate.
PPO assumes informative reward gradients.
Collapse breaks that assumption.
Distinction from Reward Model Overfitting
Overfitting:
- Memorizes preference dataset.
- Fails to generalize.
Collapse:
- Loses ranking resolution entirely.
- Produces degenerate reward outputs.
Both are severe, but collapse is more catastrophic.
Scaling Context
Large models:
- More powerful policy optimization.
- Greater ability to exploit reward weaknesses.
- Increased pressure on reward model stability.
As model capability increases, reward collapse risk increases.
Reward model quality becomes scaling bottleneck.
Alignment Implications
Reward collapse can lead to:
- Policy stagnation.
- Unpredictable behavior.
- Exploitative but meaningless outputs.
- Alignment regression.
It undermines the feedback mechanism.
Reward signal integrity is central to alignment.
Governance Perspective
Mitigation requires:
- Reward variance monitoring.
- Validation across diverse prompt sets.
- Adversarial stress testing.
- Periodic reward recalibration.
- Independent evaluation datasets.
Reward collapse should trigger training audits.
Mitigation Strategies
- Early stopping.
- Regularization.
- Larger and more diverse preference data.
- KL-anchored policy training.
- Ensemble reward models.
- Uncertainty estimation in reward.
Diversity preservation reduces collapse risk.
Summary
Reward Model Collapse:
- Degeneration of reward signal.
- Loss of meaningful ranking.
- Destabilizes policy optimization.
- Often emerges via feedback loops.
- Critical failure mode in RLHF systems.
Stable alignment requires robust reward modeling.
Related Concepts
- Reward Model Overfitting
- Reinforcement Learning from Human Feedback (RLHF)
- KL Penalty in RLHF
- Policy Collapse
- Reward Hacking
- Preference Drift
- Alignment Fragility
- Evaluation Governance