Short Definition
Long Short-Term Memory (LSTM) is a recurrent neural network architecture designed to capture long-term dependencies by using gated memory cells.
Definition
Long Short-Term Memory (LSTM) networks are a specialized type of recurrent neural network (RNN) that address the vanishing gradient problem by introducing gated mechanisms to control information flow. LSTMs maintain a persistent memory cell that selectively retains, updates, or forgets information across time steps.
Memory becomes controlled, not accidental.
Why It Matters
Vanilla RNNs struggle with long-range dependencies due to gradient decay during backpropagation through time. LSTMs:
- preserve information across long sequences
- stabilize gradient flow
- enable learning of temporal dependencies spanning many steps
They extended the practical limits of sequence modeling.
Core Mechanism
At each time step, an LSTM cell computes:
- Forget gate: decides what to discard
- Input gate: decides what new information to store
- Output gate: decides what to expose
Key equations:
f_t = σ(W_f x_t + U_f h_{t-1})
i_t = σ(W_i x_t + U_i h_{t-1})
o_t = σ(W_o x_t + U_o h_{t-1})
c̃t = tanh(W_c x_t + U_c h{t-1})
c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t
h_t = o_t ⊙ tanh(c_t)
Where:
- ct = cell state (long-term memory)
- ht = hidden state (short-term output)
- ⊙ = element-wise multiplication
Gates regulate memory flow.
Minimal Conceptual Illustration
┌───────────────┐
x_t → [ Input Gate ] │
[ Forget Gate ] → Cell State → [ Output Gate ] → h_t
h_{t-1} └───────────────┘
The cell state acts as a conveyor belt for long-term memory.
Key Architectural Features
- Separate cell state and hidden state
- Additive memory updates (mitigates vanishing gradients)
- Sigmoid gating functions
- Parameter sharing across time
Additive updates preserve gradient flow.
Relationship to Vanishing Gradients
The additive update of the cell state allows gradients to propagate more effectively across time steps, reducing gradient decay compared to vanilla RNNs.
Memory paths become gradient highways.
Relationship to Gating Mechanisms
LSTMs are a canonical example of gating in neural networks, inspiring later architectures such as GRUs and attention mechanisms.
Gates became foundational.
Applications
LSTMs have been widely used in:
- language modeling
- machine translation
- speech recognition
- handwriting generation
- time-series forecasting
They defined pre-transformer NLP.
Limitations
Despite improvements over vanilla RNNs:
- Training remains sequential (limited parallelism)
- Computationally heavier than GRUs
- Transformers outperform LSTMs in large-scale NLP tasks
Parallel attention surpassed recurrence.
LSTM vs GRU
- LSTM: separate cell and hidden states
- GRU: simpler gating, fewer parameters
- LSTM: more expressive but heavier
Complexity vs simplicity trade-off.
LSTM vs Transformer
- LSTM: sequential recurrence
- Transformer: global attention
- LSTM: better for streaming or small-resource settings
- Transformer: better for large-scale parallel training
Attention scales better in practice.
Practical Considerations
When using LSTMs:
- apply gradient clipping
- consider dropout between layers
- manage hidden state initialization carefully
- use bidirectional variants when full context is available
Sequential modeling requires care.
Common Pitfalls
- ignoring exploding gradients
- using LSTMs for very long sequences without truncation
- mismanaging hidden state across batches
- assuming LSTMs automatically solve long-term memory
Gating helps but does not guarantee perfection.
Summary Characteristics
| Aspect | LSTM |
|---|---|
| Architecture type | Gated recurrent |
| Memory structure | Cell state + hidden state |
| Main advantage | Long-term dependency modeling |
| Main limitation | Sequential training |
| Modern alternative | Transformers |
Related Concepts
- Architecture & Representation
- Recurrent Neural Network (RNN)
- Vanishing Gradients
- Exploding Gradients
- Gating Mechanisms
- Backpropagation Through Time (BPTT)
- Gated Recurrent Unit (GRU)
- Transformers