White-Box Attacks

Short Definition

White-box attacks are adversarial attacks in which the attacker has full access to the model.

Definition

White-box attacks assume that the attacker has complete knowledge of the target model, including its architecture, parameters, gradients, and training objective. Using this information, the attacker crafts inputs that intentionally cause the model to produce incorrect predictions.

These attacks represent a worst-case threat model and are commonly used to study fundamental robustness limits.

Why It Matters

White-box attacks reveal the maximum vulnerability of a model under ideal attacker conditions. If a model fails under white-box attacks, it demonstrates structural weaknesses in its learned decision boundaries.

They are essential for:

stress-testing model robustness
comparing defensive techniques
understanding theoretical limits of generalization
avoiding false security assumptions

Robustness claims are often evaluated relative to white-box threats.

How White-Box Attacks Work (Conceptually)

The attacker defines an objective (e.g., increase loss)
Gradients of the loss with respect to inputs are computed
Small, targeted perturbations are applied to the input
The perturbed input crosses a decision boundary

Access to gradients enables precise and efficient attacks.

Common Characteristics

Require full model access
Use gradient-based optimization
Produce minimal but effective perturbations
Highly effective against standard neural networks

White-box attacks are typically stronger than black-box attacks.

Example Attack Objectives

Untargeted: maximize prediction error
Targeted: force prediction into a chosen class

Both objectives exploit gradient information.

Minimal Conceptual Example

			
# conceptual illustration
perturbation = epsilon * sign(gradient(loss, input))
adversarial_input = input + perturbation

This illustrates how gradient direction guides the attack.

Limitations of White-Box Attacks

Assume unrealistic attacker access in many real-world settings
May overestimate practical vulnerability
Do not capture transfer or query constraints

Despite these limits, they remain the strongest diagnostic tool.

Common Pitfalls

Assuming robustness to white-box attacks implies real-world security
Evaluating only a single attack method
Ignoring calibration and confidence failures under attack
Treating white-box robustness as binary

Robustness is always relative to a threat model.

Related Concepts

Adversarial Attacks (Overview)
Adversarial Examples
Black-Box Attacks
Targeted Attacks
Untargeted Attacks
Model Robustness