Short Definition
Contextual bandits are bandit algorithms that condition action selection on observed context, enabling personalized decisions while balancing exploration and exploitation.
Definition
Contextual bandits extend the multi-armed bandit framework by incorporating side information (context) available at decision time. At each step, the system observes a context, selects an action based on that context, receives a reward for the chosen action only, and updates its policy accordingly.
Decisions adapt to situations, not just averages.
Why It Matters
Many real-world systems—recommendations, ads, notifications, pricing, clinical decision support—must personalize actions based on user or environment context while learning from partial feedback. Contextual bandits provide a principled framework for learning personalized policies online with controlled exploration.
Personalization requires exploration.
Core Components
A contextual bandit system consists of:
- Context (x): features describing the situation (user, item, time, state)
- Action (a): a choice among available options
- Reward (r): feedback observed only for the chosen action
- Policy (π): mapping from context to action probabilities
Learning happens at decision time.
Minimal Conceptual Illustration
Context → Policy → Action → Reward → Update Policy
Relationship to Supervised Learning
Unlike supervised learning:
- labels are observed only for chosen actions
- data is action-dependent
- exploration affects data collection
- IID assumptions do not hold
Contextual bandits learn by intervening.
Common Algorithmic Approaches
Linear Contextual Bandits
Assume linear reward models.
- examples: LinUCB, linear Thompson Sampling
- efficient and interpretable
- sensitive to feature design
Generalized Linear Bandits
Extend linear models to non-linear link functions.
- handle binary or count rewards
Non-Linear and Neural Bandits
Use neural networks to model reward functions.
- flexible representation
- harder to calibrate and evaluate
- higher governance needs
Model power increases complexity.
Exploration Strategies
Contextual bandits employ:
- uncertainty-based exploration (UCB)
- posterior sampling (Thompson Sampling)
- randomized policies
- epsilon-greedy variants
Exploration is context-dependent.
Relationship to Exploration vs Exploitation
Contextual bandits operationalize the exploration–exploitation trade-off by conditioning uncertainty on context. The same action may be exploited in one context and explored in another.
Exploration is selective.
Relationship to Causal Evaluation
Because contextual bandits randomize actions conditional on context, they generate data suitable for causal inference and counterfactual estimation within the observed context space.
Context enables conditional causality.
Counterfactual Logging Requirements
Effective contextual bandits must log:
- chosen action
- action probabilities (propensities)
- context features
- policy version
- timestamps
Without propensities, off-policy evaluation breaks.
Evaluation Challenges
Evaluating contextual bandits is difficult due to:
- partial feedback
- policy-dependent data
- delayed rewards
- non-stationarity
- metric drift
Offline accuracy is insufficient.
Off-Policy Evaluation
Contextual bandits rely on off-policy evaluation methods such as:
- inverse propensity scoring (IPS)
- doubly robust estimators
- self-normalized estimators
Evaluation is probabilistic, not deterministic.
Risks and Failure Modes
- insufficient exploration in rare contexts
- feature leakage via context
- feedback loops narrowing context coverage
- proxy reward optimization
- calibration and uncertainty failures
Context amplifies both power and risk.
Governance Considerations
Evaluation governance should define:
- acceptable exploration rates
- safety constraints by context
- auditing for subgroup impact
- recalibration and retraining triggers
Personalization requires oversight.
Common Pitfalls
- treating contextual bandits as supervised models
- disabling exploration after early gains
- failing to log propensities
- assuming stationarity across contexts
- optimizing short-term proxy rewards
Contextual learning is fragile.
Summary Characteristics
| Aspect | Contextual Bandits |
|---|---|
| Personalization | High |
| Feedback | Partial |
| Learning mode | Online |
| Causal validity | Enabled |
| Governance need | High |
Related Concepts
- Generalization & Evaluation
- Bandit Algorithms (Overview)
- Exploration vs Exploitation
- Counterfactual Logging
- Causal Evaluation
- Off-Policy Evaluation
- Feedback Loops
- Metric Drift