Contextual Bandits (Deep Dive)

Short Definition

Contextual bandits are bandit algorithms that condition action selection on observed context, enabling personalized decisions while balancing exploration and exploitation.

Definition

Contextual bandits extend the multi-armed bandit framework by incorporating side information (context) available at decision time. At each step, the system observes a context, selects an action based on that context, receives a reward for the chosen action only, and updates its policy accordingly.

Decisions adapt to situations, not just averages.

Why It Matters

Many real-world systems—recommendations, ads, notifications, pricing, clinical decision support—must personalize actions based on user or environment context while learning from partial feedback. Contextual bandits provide a principled framework for learning personalized policies online with controlled exploration.

Personalization requires exploration.

Core Components

A contextual bandit system consists of:

Context (x): features describing the situation (user, item, time, state)
Action (a): a choice among available options
Reward (r): feedback observed only for the chosen action
Policy (π): mapping from context to action probabilities

Learning happens at decision time.

Minimal Conceptual Illustration

Context → Policy → Action → Reward → Update Policy

Relationship to Supervised Learning

Unlike supervised learning:

labels are observed only for chosen actions
data is action-dependent
exploration affects data collection
IID assumptions do not hold

Contextual bandits learn by intervening.

Common Algorithmic Approaches

Linear Contextual Bandits

Assume linear reward models.

examples: LinUCB, linear Thompson Sampling
efficient and interpretable
sensitive to feature design

Generalized Linear Bandits

Extend linear models to non-linear link functions.

handle binary or count rewards

Non-Linear and Neural Bandits

Use neural networks to model reward functions.

flexible representation
harder to calibrate and evaluate
higher governance needs

Model power increases complexity.

Exploration Strategies

Contextual bandits employ:

uncertainty-based exploration (UCB)
posterior sampling (Thompson Sampling)
randomized policies
epsilon-greedy variants

Exploration is context-dependent.

Relationship to Exploration vs Exploitation

Contextual bandits operationalize the exploration–exploitation trade-off by conditioning uncertainty on context. The same action may be exploited in one context and explored in another.

Exploration is selective.

Relationship to Causal Evaluation

Because contextual bandits randomize actions conditional on context, they generate data suitable for causal inference and counterfactual estimation within the observed context space.

Context enables conditional causality.

Counterfactual Logging Requirements

Effective contextual bandits must log:

chosen action
action probabilities (propensities)
context features
policy version
timestamps

Without propensities, off-policy evaluation breaks.

Evaluation Challenges

Evaluating contextual bandits is difficult due to:

partial feedback
policy-dependent data
delayed rewards
non-stationarity
metric drift

Offline accuracy is insufficient.

Off-Policy Evaluation

Contextual bandits rely on off-policy evaluation methods such as:

inverse propensity scoring (IPS)
doubly robust estimators
self-normalized estimators

Evaluation is probabilistic, not deterministic.

Risks and Failure Modes

insufficient exploration in rare contexts
feature leakage via context
feedback loops narrowing context coverage
proxy reward optimization
calibration and uncertainty failures

Context amplifies both power and risk.

Governance Considerations

Evaluation governance should define:

acceptable exploration rates
safety constraints by context
auditing for subgroup impact
recalibration and retraining triggers

Personalization requires oversight.

Common Pitfalls

treating contextual bandits as supervised models
disabling exploration after early gains
failing to log propensities
assuming stationarity across contexts
optimizing short-term proxy rewards

Contextual learning is fragile.

Summary Characteristics

Aspect	Contextual Bandits
Personalization	High
Feedback	Partial
Learning mode	Online
Causal validity	Enabled
Governance need	High

Related Concepts

Generalization & Evaluation
Bandit Algorithms (Overview)
Exploration vs Exploitation
Counterfactual Logging
Causal Evaluation
Off-Policy Evaluation
Feedback Loops
Metric Drift