Bandit Algorithms (Overview)

Short Definition

Bandit algorithms are decision-making methods that balance exploration and exploitation to maximize cumulative reward under uncertainty.

Definition

Bandit algorithms address sequential decision problems where an agent repeatedly chooses among a set of actions (arms), observes partial feedback (reward for the chosen action only), and aims to maximize long-term reward. Unlike supervised learning, bandits learn from interaction rather than fixed datasets.

Bandits learn by acting.

Why It Matters

Many deployed ML systems—recommendation engines, ad selection, pricing, routing, and experimentation—must make decisions online while learning from their consequences. Bandit algorithms provide a principled framework for learning under uncertainty while controlling the exploration–exploitation trade-off.

Static optimization fails in dynamic environments.

The Multi-Armed Bandit Problem

The classical formulation involves:

a set of actions (arms)
unknown reward distributions per arm
sequential action selection
observation of reward for chosen arm only

Unchosen outcomes remain unknown.

Key Characteristics

Bandit settings are defined by:

partial feedback (no full labels)
online learning
non-IID data
action-dependent observations
cumulative reward objectives

Bandits violate standard ML assumptions.

Minimal Conceptual Illustration

Choose Action → Observe Reward → Update Beliefs → Repeat

Exploration vs Exploitation in Bandits

Bandit algorithms explicitly manage:

exploration: trying uncertain actions to learn rewards
exploitation: choosing actions believed to be optimal

This trade-off is central, not incidental.

Common Classes of Bandit Algorithms

Stochastic Bandits

Assume each action has a fixed but unknown reward distribution.

Contextual Bandits

Incorporate side information (context) to condition action choice.

widely used in recommender systems

Adversarial Bandits

Make no assumptions about reward generation.

robust but often conservative

Different assumptions, different guarantees.

Representative Algorithms

Common bandit approaches include:

epsilon-greedy
Upper Confidence Bound (UCB)
Thompson Sampling
EXP3 (adversarial setting)
LinUCB (contextual)

Algorithms encode exploration strategies.

Relationship to Online Evaluation

Bandit algorithms naturally integrate learning and evaluation. Performance is measured by cumulative reward rather than static accuracy.

Evaluation is continuous.

Relationship to Causal Evaluation

Because bandits randomize action selection, they generate data suitable for causal inference and counterfactual estimation.

Bandits enable causal learning at scale.

Relationship to Counterfactual Logging

Bandit policies typically log action probabilities, making counterfactual logging feasible and enabling off-policy evaluation.

Bandits produce evaluable logs.

Limitations and Risks

Bandit algorithms introduce:

short-term performance loss due to exploration
sensitivity to reward definition
difficulty handling delayed rewards
governance and safety challenges
metric gaming risks if rewards are proxies

Rewards shape behavior.

Bandits vs Supervised Learning

Aspect	Bandits	Supervised Learning
Feedback	Partial	Full labels
Learning mode	Online	Offline
Data IID	No	Often assumed
Objective	Cumulative reward	Predictive accuracy
Evaluation	Integrated	Separate

Bandits operate closer to reality.

Common Pitfalls

defining poor reward functions
disabling exploration prematurely
ignoring delayed or sparse rewards
assuming stationarity
evaluating bandits with offline accuracy metrics

Bandits require new evaluation thinking.

Summary Characteristics

Aspect	Bandit Algorithms
Learning mode	Sequential
Feedback	Action-dependent
Core trade-off	Exploration vs exploitation
Evaluation	Online
Deployment relevance	High

Related Concepts

Generalization & Evaluation
Exploration vs Exploitation
Online vs Offline Evaluation
Counterfactual Logging
Causal Evaluation
Feedback Loops
Metric Gaming
Outcome-Aware Evaluation