White-Box Attacks

Short Definition

White-box attacks are adversarial attacks in which the attacker has full access to the model.

Definition

White-box attacks assume that the attacker has complete knowledge of the target model, including its architecture, parameters, gradients, and training objective. Using this information, the attacker crafts inputs that intentionally cause the model to produce incorrect predictions.

These attacks represent a worst-case threat model and are commonly used to study fundamental robustness limits.

Why It Matters

White-box attacks reveal the maximum vulnerability of a model under ideal attacker conditions. If a model fails under white-box attacks, it demonstrates structural weaknesses in its learned decision boundaries.

They are essential for:

  • stress-testing model robustness
  • comparing defensive techniques
  • understanding theoretical limits of generalization
  • avoiding false security assumptions

Robustness claims are often evaluated relative to white-box threats.

How White-Box Attacks Work (Conceptually)

  • The attacker defines an objective (e.g., increase loss)
  • Gradients of the loss with respect to inputs are computed
  • Small, targeted perturbations are applied to the input
  • The perturbed input crosses a decision boundary

Access to gradients enables precise and efficient attacks.

Common Characteristics

  • Require full model access
  • Use gradient-based optimization
  • Produce minimal but effective perturbations
  • Highly effective against standard neural networks

White-box attacks are typically stronger than black-box attacks.

Example Attack Objectives

  • Untargeted: maximize prediction error
  • Targeted: force prediction into a chosen class

Both objectives exploit gradient information.

Minimal Conceptual Example

# conceptual illustration
perturbation = epsilon * sign(gradient(loss, input))
adversarial_input = input + perturbation

This illustrates how gradient direction guides the attack.

Limitations of White-Box Attacks

  • Assume unrealistic attacker access in many real-world settings
  • May overestimate practical vulnerability
  • Do not capture transfer or query constraints

Despite these limits, they remain the strongest diagnostic tool.

Common Pitfalls

  • Assuming robustness to white-box attacks implies real-world security
  • Evaluating only a single attack method
  • Ignoring calibration and confidence failures under attack
  • Treating white-box robustness as binary

Robustness is always relative to a threat model.

Related Concepts