Self-Attention

Short Definition

Self-Attention is a mechanism in neural networks that allows each element of a sequence to dynamically focus on other elements in the same sequence when computing its representation.

It enables models to capture relationships between tokens regardless of their distance in the sequence.

Definition

In sequence modeling, each token often depends on the context provided by other tokens.

Self-attention allows a model to compute contextualized representations by comparing each token with all others in the sequence.

Given a sequence:

[
x_1, x_2, …, x_n
]

the model computes attention scores between every pair of tokens.

These scores determine how strongly one token should influence another during representation learning.

Core Idea

Instead of processing tokens sequentially like recurrent models, self-attention allows every token to interact with every other token simultaneously.

x1 ↔ x2 ↔ x3 ↔ x4 ↔ x5

Each token gathers information from the entire sequence to build its representation.

This enables the model to capture long-range dependencies efficiently.

Query–Key–Value Mechanism

Self-attention uses three learned projections of each token representation:

  • Query (Q)
  • Key (K)
  • Value (V)

For each token embedding (x):

[
Q = xW_Q
]

[
K = xW_K
]

[
V = xW_V
]

The attention score between tokens is computed using the scaled dot product:

[
Attention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V
]

Where:

  • (QK^T) measures similarity between tokens
  • (d_k) is the dimensionality of keys
  • softmax converts similarities into attention weights

These weights determine how much information flows between tokens.

Minimal Conceptual Illustration

Sentence:

The cat sat on the mat


When computing the representation for **sat**, the model may attend strongly to:

cat
sat
mat

This helps the model understand the grammatical and semantic relationships in the sentence.

Self-Attention vs Other Mechanisms

MechanismContext Modeling
RNNsequential recurrence
CNNlocal receptive fields
Self-Attentionglobal pairwise interactions

Self-attention directly connects all tokens in a sequence.

Multi-Head Self-Attention

Transformers extend self-attention using multiple attention heads.

Each head learns different relationships between tokens.

[
MultiHead(Q,K,V) = Concat(head_1, …, head_h)W^O
]

Different heads may capture:

  • syntactic relationships
  • semantic dependencies
  • positional interactions

Advantages

Self-attention provides several benefits:

  • captures long-range dependencies
  • parallel computation across tokens
  • flexible representation learning
  • strong scalability

These properties made it central to modern large language models.

Computational Complexity

Self-attention computes interactions between all token pairs.

For sequence length (n):

[
O(n^2)
]

This quadratic complexity makes long sequences expensive to process.

Many research efforts focus on improving attention efficiency.

Applications

Self-attention is used in many modern architectures:

  • Transformers
  • Large Language Models
  • Vision Transformers
  • Multimodal models

It is a core component of modern deep learning systems.

Summary

Self-attention is a mechanism that allows each token in a sequence to dynamically attend to all other tokens when computing its representation.

By enabling global contextual interactions, it provides powerful modeling capabilities and forms the foundation of the Transformer architecture.

Related Concepts

  • Transformer Architecture
  • Multi-Head Attention
  • Positional Encoding
  • Encoder–Decoder Models
  • Attention Mechanism
  • Recurrent Neural Networks (RNN)
  • State-Space Models