Positional Encoding

Short Definition

Positional encoding is a method used in Transformer models to inject information about token order into input representations.

Definition

Transformers process sequences in parallel and lack inherent recurrence or convolution to encode order. Positional encoding adds explicit position information to token embeddings so the model can distinguish between different token positions within a sequence.

Without position, attention is permutation-invariant.

Why It Matters

Self-attention treats inputs as a set:

it does not inherently know which token comes first
it cannot distinguish reordered sequences

For example:

“The cat chased the dog”
“The dog chased the cat”

Without positional information, these may look identical to pure attention.

Order defines meaning.

Core Mechanism

The original Transformer introduced sinusoidal positional encodings:

For position $pos$ pos and dimension $i$ i:

			
PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

These values are added to token embeddings:

InputEmbedding + PositionalEncoding

Position becomes part of the representation.

Minimal Conceptual Illustration

			
Token Embedding:       [0.2, 0.8, 0.5, ...]
Positional Encoding:   [0.1, 0.3, 0.9, ...]
Final Input:           [0.3, 1.1, 1.4, ...]

Position modifies meaning.

Why Sinusoidal?

Sinusoidal encodings:

allow extrapolation to longer sequences
encode relative distances implicitly
provide smooth, continuous positional variation

Relative position emerges from phase differences.

Learned Positional Embeddings

An alternative approach:

learn position embeddings directly
treat position like a token

This often improves performance but:

may not generalize beyond training length
depends on maximum sequence size

Learned encodings trade flexibility for adaptability.

Absolute vs Relative Position Encoding

Type	Description
Absolute	Each position has a unique embedding
Relative	Encodes distance between tokens
Rotary (RoPE)	Rotates embeddings to encode position
ALiBi	Adds linear bias to attention scores

Modern architectures increasingly use relative methods.

Relationship to Self-Attention

Self-attention computes similarity:

Attention(Q, K, V)

Positional encoding ensures:

queries and keys incorporate position
attention weights reflect order
temporal relationships are learnable

Position influences attention scores.

Impact on Long-Sequence Modeling

Positional encoding:

determines extrapolation behavior
affects stability at longer sequence lengths
influences model scaling

Poor positional handling limits generalization.

Common Pitfalls

forgetting positional encoding entirely
using fixed maximum lengths incorrectly
assuming sinusoidal always outperforms learned
mismanaging padding positions

Position is subtle but critical.

Positional Encoding vs Recurrence

Aspect	Recurrence (RNN)	Positional Encoding
Order awareness	Inherent	Explicitly added
Parallelism	No	Yes
Memory mechanism	Hidden state	Attention weights

Transformers externalize order instead of encoding it sequentially.

Practical Considerations

When implementing positional encoding:

ensure alignment with embedding dimension
handle padding tokens carefully
consider relative encodings for long sequences
test extrapolation performance

Order handling affects scaling behavior.

Summary Characteristics

Aspect	Positional Encoding
Purpose	Inject order information
Used in	Transformers
Default type	Sinusoidal (original paper)
Modern variants	Relative, Rotary, ALiBi
Critical for	Long-sequence modeling

Related Concepts

Architecture & Representation
Attention Mechanism
Self-Attention
Multi-Head Attention
Transformer Architecture
Representation Learning
Scaling Laws