Contextual Bandits (Deep Dive)

Short Definition

Contextual bandits are bandit algorithms that condition action selection on observed context, enabling personalized decisions while balancing exploration and exploitation.

Definition

Contextual bandits extend the multi-armed bandit framework by incorporating side information (context) available at decision time. At each step, the system observes a context, selects an action based on that context, receives a reward for the chosen action only, and updates its policy accordingly.

Decisions adapt to situations, not just averages.

Why It Matters

Many real-world systems—recommendations, ads, notifications, pricing, clinical decision support—must personalize actions based on user or environment context while learning from partial feedback. Contextual bandits provide a principled framework for learning personalized policies online with controlled exploration.

Personalization requires exploration.

Core Components

A contextual bandit system consists of:

  • Context (x): features describing the situation (user, item, time, state)
  • Action (a): a choice among available options
  • Reward (r): feedback observed only for the chosen action
  • Policy (π): mapping from context to action probabilities

Learning happens at decision time.

Minimal Conceptual Illustration


Context → Policy → Action → Reward → Update Policy

Relationship to Supervised Learning

Unlike supervised learning:

  • labels are observed only for chosen actions
  • data is action-dependent
  • exploration affects data collection
  • IID assumptions do not hold

Contextual bandits learn by intervening.

Common Algorithmic Approaches

Linear Contextual Bandits

Assume linear reward models.

  • examples: LinUCB, linear Thompson Sampling
  • efficient and interpretable
  • sensitive to feature design

Generalized Linear Bandits

Extend linear models to non-linear link functions.

  • handle binary or count rewards

Non-Linear and Neural Bandits

Use neural networks to model reward functions.

  • flexible representation
  • harder to calibrate and evaluate
  • higher governance needs

Model power increases complexity.

Exploration Strategies

Contextual bandits employ:

  • uncertainty-based exploration (UCB)
  • posterior sampling (Thompson Sampling)
  • randomized policies
  • epsilon-greedy variants

Exploration is context-dependent.

Relationship to Exploration vs Exploitation

Contextual bandits operationalize the exploration–exploitation trade-off by conditioning uncertainty on context. The same action may be exploited in one context and explored in another.

Exploration is selective.

Relationship to Causal Evaluation

Because contextual bandits randomize actions conditional on context, they generate data suitable for causal inference and counterfactual estimation within the observed context space.

Context enables conditional causality.

Counterfactual Logging Requirements

Effective contextual bandits must log:

  • chosen action
  • action probabilities (propensities)
  • context features
  • policy version
  • timestamps

Without propensities, off-policy evaluation breaks.

Evaluation Challenges

Evaluating contextual bandits is difficult due to:

  • partial feedback
  • policy-dependent data
  • delayed rewards
  • non-stationarity
  • metric drift

Offline accuracy is insufficient.

Off-Policy Evaluation

Contextual bandits rely on off-policy evaluation methods such as:

  • inverse propensity scoring (IPS)
  • doubly robust estimators
  • self-normalized estimators

Evaluation is probabilistic, not deterministic.

Risks and Failure Modes

  • insufficient exploration in rare contexts
  • feature leakage via context
  • feedback loops narrowing context coverage
  • proxy reward optimization
  • calibration and uncertainty failures

Context amplifies both power and risk.

Governance Considerations

Evaluation governance should define:

  • acceptable exploration rates
  • safety constraints by context
  • auditing for subgroup impact
  • recalibration and retraining triggers

Personalization requires oversight.

Common Pitfalls

  • treating contextual bandits as supervised models
  • disabling exploration after early gains
  • failing to log propensities
  • assuming stationarity across contexts
  • optimizing short-term proxy rewards

Contextual learning is fragile.

Summary Characteristics

AspectContextual Bandits
PersonalizationHigh
FeedbackPartial
Learning modeOnline
Causal validityEnabled
Governance needHigh

Related Concepts

  • Generalization & Evaluation
  • Bandit Algorithms (Overview)
  • Exploration vs Exploitation
  • Counterfactual Logging
  • Causal Evaluation
  • Off-Policy Evaluation
  • Feedback Loops
  • Metric Drift