Reward Design

Short Definition

Reward design is the process of defining the reward signal that guides learning and decision-making in interactive machine learning systems.

Definition

Reward design specifies how outcomes, behaviors, and constraints are translated into a numerical signal used by learning algorithms—particularly bandits and reinforcement learning—to optimize decisions over time. The reward encodes what the system is incentivized to do.

The reward defines success.

Why It Matters

Learning systems optimize exactly what they are rewarded for—no more, no less. Poorly designed rewards lead to unintended behavior, metric gaming, and long-term harm, even when short-term metrics improve.

Bad rewards teach the wrong lesson.

Characteristics of an Effective Reward

A well-designed reward should be:

aligned with real objectives
sensitive to meaningful outcomes
robust to gaming
stable under distribution shift
interpretable and auditable

Rewards are value judgments.

Types of Reward Signals

Immediate Rewards

Observed shortly after action.

low latency
often proxies
easier to optimize

Delayed Rewards

Observed after an outcome horizon.

higher fidelity
harder to attribute
require temporal credit assignment

Sparse Rewards

Rare but high-impact signals.

common in safety and risk domains
difficult to learn from

Shaped Rewards

Augmented with intermediate signals.

accelerate learning
increase Goodhart risk

Shaping trades speed for risk.

Minimal Conceptual Illustration

Action → Reward Signal → Policy Update

Relationship to Proxy Metrics

Rewards are often implemented using proxy metrics due to delayed or costly true outcomes. This makes reward design a primary source of proxy risk and Goodhart effects.

Rewards are operationalized proxies.

Relationship to Goodhart’s Law

Reward optimization is the most direct trigger of Goodhart’s Law. Once a reward becomes the target, the system may exploit loopholes, shortcuts, or correlations that inflate reward without improving outcomes.

Rewards must be defended.

Reward Design in Bandit Systems

In bandits, rewards:

are observed only for chosen actions
define cumulative optimization objectives
directly influence exploration behavior

Reward choice shapes learning dynamics.

Reward Design vs Evaluation Metrics

Rewards drive learning; evaluation metrics assess performance. Conflating the two increases gaming risk and obscures failures.

What you train on should not be the only thing you evaluate.

Handling Trade-offs and Constraints

Effective reward design may incorporate:

cost penalties
risk constraints
fairness regularizers
abstention or deferral costs
exploration budgets

Constraints belong in the reward—or alongside it.

Dealing with Delayed and Noisy Rewards

Common strategies include:

reward discounting
temporal aggregation
survival or time-to-event modeling
delayed credit assignment
outcome-aware auditing

Delayed rewards require patience.

Governance and Review

Reward design should be:

documented explicitly
reviewed periodically
validated against long-term outcomes
revised when objectives change

Rewards encode organizational values.

Common Pitfalls

optimizing convenience over correctness
using single scalar rewards for multi-objective problems
ignoring long-term effects
failing to revisit reward definitions
assuming reward improvement implies outcome improvement

Rewards do not self-correct.

Summary Characteristics

Aspect	Reward Design
Role	Defines learning objective
Risk	High if misaligned
Proxy reliance	Common
Governance need	Critical
Long-term impact	Strong

Related Concepts

Generalization & Evaluation
Bandit Algorithms (Overview)
Contextual Bandits (Deep Dive)
Exploration vs Exploitation
Proxy Metrics
Goodhart’s Law (ML Context)
Outcome-Aware Evaluation
Decision Cost Functions