Targeted Attacks

Short Definition

Targeted attacks are adversarial attacks designed to force a model to predict a specific incorrect output.

Definition

Targeted attacks are a class of adversarial attacks in which the attacker’s goal is not merely to cause an incorrect prediction, but to steer the model toward a particular, chosen target class or output. The attacker explicitly specifies the desired misclassification.

These attacks require more control than untargeted attacks and therefore represent a stronger and more deliberate form of adversarial manipulation.

Why It Matters

Targeted attacks demonstrate that models can be manipulated in precise and predictable ways. This has serious implications for systems where specific outcomes carry meaning, such as:

identity recognition
authorization systems
content moderation
medical or legal classification

A model vulnerable to targeted attacks may be coerced into producing strategically chosen errors.

How Targeted Attacks Work (Conceptually)

The attacker selects a target class different from the true label
An objective is defined to increase the model’s confidence in that target
Perturbations are optimized to move the input toward the target decision region
The attack succeeds when the model predicts the chosen target

The attack objective is directional, not merely disruptive.

Targeted vs Untargeted Attacks

Targeted attacks: aim for a specific wrong outcome
Untargeted attacks: aim for any wrong outcome

Targeted attacks are generally harder to execute but more revealing of model vulnerabilities.

Minimal Conceptual Example

			
# conceptual targeted objective
minimize loss(model(input + perturbation), target_class)

This objective explicitly pushes the prediction toward a chosen target.

Common Characteristics

Require defining a target output
Often rely on gradient-based optimization
More constrained than untargeted attacks
Can produce semantically meaningful misclassifications

Targeted attacks probe fine-grained control over decision boundaries.

Common Pitfalls

Assuming targeted attacks are unrealistic or impractical
Evaluating robustness only against untargeted attacks
Ignoring high-confidence targeted failures
Treating targeted robustness as binary

Targeted robustness is relative to the attacker’s goals.

Related Concepts

Model Robustness
Adversarial Attacks (Overview)
Untargeted Attacks
White-Box Attacks
Black-Box Attacks
Adversarial Examples