Positional Encoding

Understanding positional encoding in transformers - Neural Networks Lexicon
Understanding positional encoding in transformers – Neural Networks Lexicon

Short Definition

Positional encoding is a method used in Transformer models to inject information about token order into input representations.

Definition

Transformers process sequences in parallel and lack inherent recurrence or convolution to encode order. Positional encoding adds explicit position information to token embeddings so the model can distinguish between different token positions within a sequence.

Without position, attention is permutation-invariant.

Why It Matters

Self-attention treats inputs as a set:

  • it does not inherently know which token comes first
  • it cannot distinguish reordered sequences

For example:


“The cat chased the dog”
“The dog chased the cat”

Without positional information, these may look identical to pure attention.

Order defines meaning.

Core Mechanism

The original Transformer introduced sinusoidal positional encodings:

For position pospospos and dimension iii:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

These values are added to token embeddings:

InputEmbedding + PositionalEncoding

Position becomes part of the representation.

Minimal Conceptual Illustration

Token Embedding: [0.2, 0.8, 0.5, ...]
Positional Encoding: [0.1, 0.3, 0.9, ...]
Final Input: [0.3, 1.1, 1.4, ...]

Position modifies meaning.

Why Sinusoidal?

Sinusoidal encodings:

  • allow extrapolation to longer sequences
  • encode relative distances implicitly
  • provide smooth, continuous positional variation

Relative position emerges from phase differences.

Learned Positional Embeddings

An alternative approach:

  • learn position embeddings directly
  • treat position like a token

This often improves performance but:

  • may not generalize beyond training length
  • depends on maximum sequence size

Learned encodings trade flexibility for adaptability.

Absolute vs Relative Position Encoding

TypeDescription
AbsoluteEach position has a unique embedding
RelativeEncodes distance between tokens
Rotary (RoPE)Rotates embeddings to encode position
ALiBiAdds linear bias to attention scores

Modern architectures increasingly use relative methods.

Relationship to Self-Attention

Self-attention computes similarity:

Attention(Q, K, V)

Positional encoding ensures:

  • queries and keys incorporate position
  • attention weights reflect order
  • temporal relationships are learnable

Position influences attention scores.

Impact on Long-Sequence Modeling

Positional encoding:

  • determines extrapolation behavior
  • affects stability at longer sequence lengths
  • influences model scaling

Poor positional handling limits generalization.

Common Pitfalls

  • forgetting positional encoding entirely
  • using fixed maximum lengths incorrectly
  • assuming sinusoidal always outperforms learned
  • mismanaging padding positions

Position is subtle but critical.

Positional Encoding vs Recurrence

AspectRecurrence (RNN)Positional Encoding
Order awarenessInherentExplicitly added
ParallelismNoYes
Memory mechanismHidden stateAttention weights

Transformers externalize order instead of encoding it sequentially.

Practical Considerations

When implementing positional encoding:

  • ensure alignment with embedding dimension
  • handle padding tokens carefully
  • consider relative encodings for long sequences
  • test extrapolation performance

Order handling affects scaling behavior.

Summary Characteristics

AspectPositional Encoding
PurposeInject order information
Used inTransformers
Default typeSinusoidal (original paper)
Modern variantsRelative, Rotary, ALiBi
Critical forLong-sequence modeling

Related Concepts

  • Architecture & Representation
  • Attention Mechanism
  • Self-Attention
  • Multi-Head Attention
  • Transformer Architecture
  • Representation Learning
  • Scaling Laws