Short Definition
Bandit algorithms are decision-making methods that balance exploration and exploitation to maximize cumulative reward under uncertainty.
Definition
Bandit algorithms address sequential decision problems where an agent repeatedly chooses among a set of actions (arms), observes partial feedback (reward for the chosen action only), and aims to maximize long-term reward. Unlike supervised learning, bandits learn from interaction rather than fixed datasets.
Bandits learn by acting.
Why It Matters
Many deployed ML systems—recommendation engines, ad selection, pricing, routing, and experimentation—must make decisions online while learning from their consequences. Bandit algorithms provide a principled framework for learning under uncertainty while controlling the exploration–exploitation trade-off.
Static optimization fails in dynamic environments.
The Multi-Armed Bandit Problem
The classical formulation involves:
- a set of actions (arms)
- unknown reward distributions per arm
- sequential action selection
- observation of reward for chosen arm only
Unchosen outcomes remain unknown.
Key Characteristics
Bandit settings are defined by:
- partial feedback (no full labels)
- online learning
- non-IID data
- action-dependent observations
- cumulative reward objectives
Bandits violate standard ML assumptions.
Minimal Conceptual Illustration
Choose Action → Observe Reward → Update Beliefs → Repeat
Exploration vs Exploitation in Bandits
Bandit algorithms explicitly manage:
- exploration: trying uncertain actions to learn rewards
- exploitation: choosing actions believed to be optimal
This trade-off is central, not incidental.
Common Classes of Bandit Algorithms
Stochastic Bandits
Assume each action has a fixed but unknown reward distribution.
Contextual Bandits
Incorporate side information (context) to condition action choice.
- widely used in recommender systems
Adversarial Bandits
Make no assumptions about reward generation.
- robust but often conservative
Different assumptions, different guarantees.
Representative Algorithms
Common bandit approaches include:
- epsilon-greedy
- Upper Confidence Bound (UCB)
- Thompson Sampling
- EXP3 (adversarial setting)
- LinUCB (contextual)
Algorithms encode exploration strategies.
Relationship to Online Evaluation
Bandit algorithms naturally integrate learning and evaluation. Performance is measured by cumulative reward rather than static accuracy.
Evaluation is continuous.
Relationship to Causal Evaluation
Because bandits randomize action selection, they generate data suitable for causal inference and counterfactual estimation.
Bandits enable causal learning at scale.
Relationship to Counterfactual Logging
Bandit policies typically log action probabilities, making counterfactual logging feasible and enabling off-policy evaluation.
Bandits produce evaluable logs.
Limitations and Risks
Bandit algorithms introduce:
- short-term performance loss due to exploration
- sensitivity to reward definition
- difficulty handling delayed rewards
- governance and safety challenges
- metric gaming risks if rewards are proxies
Rewards shape behavior.
Bandits vs Supervised Learning
| Aspect | Bandits | Supervised Learning |
|---|---|---|
| Feedback | Partial | Full labels |
| Learning mode | Online | Offline |
| Data IID | No | Often assumed |
| Objective | Cumulative reward | Predictive accuracy |
| Evaluation | Integrated | Separate |
Bandits operate closer to reality.
Common Pitfalls
- defining poor reward functions
- disabling exploration prematurely
- ignoring delayed or sparse rewards
- assuming stationarity
- evaluating bandits with offline accuracy metrics
Bandits require new evaluation thinking.
Summary Characteristics
| Aspect | Bandit Algorithms |
|---|---|
| Learning mode | Sequential |
| Feedback | Action-dependent |
| Core trade-off | Exploration vs exploitation |
| Evaluation | Online |
| Deployment relevance | High |
Related Concepts
- Generalization & Evaluation
- Exploration vs Exploitation
- Online vs Offline Evaluation
- Counterfactual Logging
- Causal Evaluation
- Feedback Loops
- Metric Gaming
- Outcome-Aware Evaluation