Reinforcement Learning from Human Feedback (RLHF)

Short Definition

Reinforcement Learning from Human Feedback (RLHF) is a training paradigm in which a model is optimized using a reward model trained on human preference data, enabling alignment of model behavior with human judgments.

It translates human preferences into optimization signals.

Definition

Large language models are typically pretrained using next-token prediction.
However, pretraining does not ensure that outputs are:

  • Helpful
  • Harmless
  • Honest
  • Instruction-following

RLHF adds a preference-based optimization phase.

The standard RLHF pipeline consists of three stages:

  1. Supervised Fine-Tuning (SFT) on instruction data
  2. Reward Model (RM) Training using human preference rankings
  3. Reinforcement Learning to optimize the model against the reward model

Formally:

[
\max_\theta \; \mathbb{E}{x \sim \mathcal{D}} [R\phi(x, y_\theta)]
]

Where:

  • ( \theta ) are model parameters
  • ( R_\phi ) is the learned reward model
  • ( y_\theta ) is the model’s generated output

The model learns to produce outputs that humans prefer.

Core RLHF Pipeline

1. Supervised Fine-Tuning (SFT)

Human-written instruction-response pairs train a baseline aligned model.

2. Reward Modeling

Humans rank model outputs.

Example:

Prompt → Two candidate responses
Human selects preferred response.

The reward model learns:

[
R_\phi(x, y_1) > R_\phi(x, y_2)
]

3. Reinforcement Learning Optimization

Policy optimization (often PPO) adjusts the base model to maximize reward model scores.

The model is nudged toward preferred behaviors.

Minimal Conceptual Illustration


Prompt:
“Explain gradient descent.”

Model Output A: Clear and simple.
Model Output B: Confusing and technical.

Human prefers A.

Reward model learns to score A higher.

RL step:
Adjust model to produce outputs like A.

Human preference becomes a training signal.

Why RLHF Is Needed

Pretraining optimizes language likelihood.

But likelihood ≠ usefulness.

RLHF improves:

  • Instruction following
  • Helpfulness
  • Safety refusal behavior
  • Tone alignment
  • Structured formatting

It reduces toxic, irrelevant, or unsafe outputs.

Mathematical Framing

Typical RLHF uses PPO (Proximal Policy Optimization).

Objective includes:maxθE[Rϕ(x,yθ)]βKL(πθπbase)\max_\theta \mathbb{E}[R_\phi(x,y_\theta)] – \beta KL(\pi_\theta || \pi_{\text{base}})θmax​E[Rϕ​(x,yθ​)]−βKL(πθ​∣∣πbase​)

The KL term ensures:

  • Model does not drift too far from pretrained distribution.
  • Stability is maintained.

Reward maximization is regularized.

Strengths of RLHF

  • Directly incorporates human values.
  • Improves practical usability.
  • Enables safety shaping.
  • Scales with annotation effort.
  • Works well with large models.

It significantly improves conversational alignment.

Limitations and Risks

RLHF does not guarantee true alignment.

Risks include:

  • Reward hacking
  • Proxy optimization
  • Over-optimization for superficial preferences
  • Mode collapse
  • Loss of creativity

The model may optimize the reward model rather than human intent.

Relationship to Goal Misgeneralization

RLHF reduces some misgeneralization risks.

However:

  • The reward model is itself a proxy.
  • The base model may internalize reward shortcuts.
  • Inner alignment remains unresolved.

RLHF shapes behavior, not necessarily internal objectives.

Scaling Implications

As models grow:

  • Reward modeling becomes more complex.
  • Subtle behaviors are harder to evaluate.
  • Strategic reasoning increases.
  • Deceptive alignment risk increases.

Scaling alignment is non-trivial.

Alternative Preference Methods

Newer approaches include:

  • Direct Preference Optimization (DPO)
  • Constitutional AI
  • Rejection sampling
  • Self-critique methods

These aim to simplify or stabilize preference optimization.

Governance Perspective

RLHF enables:

  • Deployable conversational AI
  • Reduced harmful outputs
  • Controlled behavior shaping
  • Incremental safety improvements

But it introduces:

  • Dependence on annotation quality
  • Institutional bias risk
  • Oversight scalability challenges

Human feedback becomes a governance bottleneck.

Summary

Reinforcement Learning from Human Feedback (RLHF):

  • Uses human preferences to train a reward model.
  • Optimizes a language model against that reward.
  • Improves helpfulness and safety.
  • Does not fully solve inner alignment.
  • Scales behaviorally but not necessarily structurally.

It is a cornerstone of modern LLM alignment pipelines.

Related Concepts