Bandit Algorithms (Overview)

Short Definition

Bandit algorithms are decision-making methods that balance exploration and exploitation to maximize cumulative reward under uncertainty.

Definition

Bandit algorithms address sequential decision problems where an agent repeatedly chooses among a set of actions (arms), observes partial feedback (reward for the chosen action only), and aims to maximize long-term reward. Unlike supervised learning, bandits learn from interaction rather than fixed datasets.

Bandits learn by acting.

Why It Matters

Many deployed ML systems—recommendation engines, ad selection, pricing, routing, and experimentation—must make decisions online while learning from their consequences. Bandit algorithms provide a principled framework for learning under uncertainty while controlling the exploration–exploitation trade-off.

Static optimization fails in dynamic environments.

The Multi-Armed Bandit Problem

The classical formulation involves:

  • a set of actions (arms)
  • unknown reward distributions per arm
  • sequential action selection
  • observation of reward for chosen arm only

Unchosen outcomes remain unknown.

Key Characteristics

Bandit settings are defined by:

  • partial feedback (no full labels)
  • online learning
  • non-IID data
  • action-dependent observations
  • cumulative reward objectives

Bandits violate standard ML assumptions.

Minimal Conceptual Illustration


Choose Action → Observe Reward → Update Beliefs → Repeat

Exploration vs Exploitation in Bandits

Bandit algorithms explicitly manage:

  • exploration: trying uncertain actions to learn rewards
  • exploitation: choosing actions believed to be optimal

This trade-off is central, not incidental.

Common Classes of Bandit Algorithms

Stochastic Bandits

Assume each action has a fixed but unknown reward distribution.

Contextual Bandits

Incorporate side information (context) to condition action choice.

  • widely used in recommender systems

Adversarial Bandits

Make no assumptions about reward generation.

  • robust but often conservative

Different assumptions, different guarantees.

Representative Algorithms

Common bandit approaches include:

  • epsilon-greedy
  • Upper Confidence Bound (UCB)
  • Thompson Sampling
  • EXP3 (adversarial setting)
  • LinUCB (contextual)

Algorithms encode exploration strategies.

Relationship to Online Evaluation

Bandit algorithms naturally integrate learning and evaluation. Performance is measured by cumulative reward rather than static accuracy.

Evaluation is continuous.

Relationship to Causal Evaluation

Because bandits randomize action selection, they generate data suitable for causal inference and counterfactual estimation.

Bandits enable causal learning at scale.

Relationship to Counterfactual Logging

Bandit policies typically log action probabilities, making counterfactual logging feasible and enabling off-policy evaluation.

Bandits produce evaluable logs.

Limitations and Risks

Bandit algorithms introduce:

  • short-term performance loss due to exploration
  • sensitivity to reward definition
  • difficulty handling delayed rewards
  • governance and safety challenges
  • metric gaming risks if rewards are proxies

Rewards shape behavior.

Bandits vs Supervised Learning

AspectBanditsSupervised Learning
FeedbackPartialFull labels
Learning modeOnlineOffline
Data IIDNoOften assumed
ObjectiveCumulative rewardPredictive accuracy
EvaluationIntegratedSeparate

Bandits operate closer to reality.

Common Pitfalls

  • defining poor reward functions
  • disabling exploration prematurely
  • ignoring delayed or sparse rewards
  • assuming stationarity
  • evaluating bandits with offline accuracy metrics

Bandits require new evaluation thinking.

Summary Characteristics

AspectBandit Algorithms
Learning modeSequential
FeedbackAction-dependent
Core trade-offExploration vs exploitation
EvaluationOnline
Deployment relevanceHigh

Related Concepts

  • Generalization & Evaluation
  • Exploration vs Exploitation
  • Online vs Offline Evaluation
  • Counterfactual Logging
  • Causal Evaluation
  • Feedback Loops
  • Metric Gaming
  • Outcome-Aware Evaluation