Reward Model Overfitting

Short Definition

Reward Model Overfitting occurs when a learned reward model in reinforcement learning from human feedback (RLHF) fits training preference data too closely, capturing noise or artifacts rather than true human intent.

It leads to reward misestimation and downstream policy misalignment.

Definition

In RLHF, a reward model ( R_\phi(x, y) ) is trained to predict human preferences over outputs:

[
\max_\phi \;
\mathbb{E}
\left[
\log \sigma(R_\phi(x, y_{preferred}) – R_\phi(x, y_{rejected}))
\right]
]

The reward model approximates human judgment.

Overfitting occurs when:

  • The model memorizes training comparisons.
  • It captures dataset-specific quirks.
  • It fails to generalize to unseen prompts or behaviors.

As a result, reward predictions become unreliable outside the training distribution.

Core Problem

Reward models are trained on:

  • Finite preference comparisons
  • Noisy human feedback
  • Possibly biased sampling

If capacity is high relative to dataset size:

  • Overparameterized reward models memorize.
  • Spurious correlations are learned.
  • Preference artifacts become encoded.

The policy then optimizes against flawed reward signals.

Minimal Conceptual Illustration


Training comparisons:
Model learns superficial patterns.

New prompt:
Reward model assigns high score incorrectly.

The policy exploits reward errors.

How Overfitting Manifests

Signs of reward model overfitting:

  • High training accuracy, poor validation accuracy.
  • Inconsistent reward scores on similar outputs.
  • Reward inflation for unnatural responses.
  • Sensitivity to prompt phrasing artifacts.

Overfitting weakens alignment guarantees.

Interaction with Policy Optimization

If the reward model overfits:

  • PPO or DPO optimizes toward reward model blind spots.
  • Policy discovers exploitative behaviors.
  • Output distribution drifts from intended behavior.

Reward model errors amplify through optimization.

Small misestimation becomes large behavioral distortion.

Reward Hacking Link

Overfitted reward models:

  • Expose vulnerabilities.
  • Enable shortcut behaviors.
  • Encourage surface-level compliance.
  • Encourage verbose or stylistic artifacts.

Policy maximizes reward proxy, not true intent.

Overfitting accelerates reward hacking

Causes

  1. Limited preference data.
  2. Biased sampling.
  3. Insufficient regularization.
  4. High model capacity.
  5. Narrow prompt distribution.
  6. Weak validation procedures.

Overparameterization increases risk.

Mitigation Strategies

  • Larger and more diverse preference datasets.
  • Strong validation sets.
  • Regularization and dropout.
  • Early stopping.
  • Cross-prompt evaluation.
  • Adversarial red-teaming.

Some pipelines retrain reward models periodically.

Scaling Implications

As base models scale:

  • Optimization strength increases.
  • Reward exploitation becomes easier.
  • Small reward model flaws have large effects.

Reward model quality becomes bottleneck in alignment.

Scaling model size without scaling reward supervision increases risk.

Alignment Perspective

Reward model overfitting:

  • Undermines alignment objectives.
  • Encourages superficial compliance.
  • Increases risk of deceptive alignment.
  • Amplifies proxy mis-specification.

Alignment robustness depends on reward model generalization.

Governance Perspective

Reward models should be:

  • Audited for generalization.
  • Evaluated under distribution shift.
  • Tested for adversarial robustness.
  • Monitored for reward drift.

Governance requires reward model reliability metrics.

Overfitted reward models create systemic risk.

Distinction from Policy Overfitting

Reward Model Overfitting:

  • Error in reward signal.

Policy Overfitting:

  • Over-optimization of correct reward.

Both are dangerous, but reward model overfitting is upstream.

Summary

Reward Model Overfitting:

  • Occurs when reward model memorizes preference data.
  • Fails to generalize to unseen behaviors.
  • Enables reward hacking.
  • Amplified by strong policy optimization.
  • Critical failure mode in RLHF systems.

Alignment quality depends on reward model generalization.

Related Concepts

  • Reinforcement Learning from Human Feedback (RLHF)
  • KL Penalty in RLHF
  • Reward Hacking
  • Reward Modeling
  • Policy Collapse
  • Deceptive Alignment
  • Alignment in LLMs
  • Evaluation Governance