Short Definition
Strided Convolution and Pooling are two techniques used to reduce spatial resolution in convolutional neural networks (CNNs). Pooling downsamples using fixed operations (e.g., max or average), while strided convolution learns the downsampling operation through trainable filters.
Definition
In convolutional neural networks, spatial downsampling reduces the width and height of feature maps to:
- Increase receptive field
- Reduce computational cost
- Improve translation invariance
- Extract higher-level features
Two primary methods are used:
- Pooling layers (fixed aggregation functions)
- Strided convolutions (learned downsampling)
The core difference:
Pooling is non-parametric.
Strided convolution is parametric.
I. Pooling
Pooling applies a fixed aggregation function over local regions.
Common types:
- Max Pooling
- Average Pooling
Example (2×2 max pooling):
Input:
[1 3
2 0]
Output:
3
Pooling reduces resolution by selecting or averaging values.
It introduces no trainable parameters.
II. Strided Convolution
A strided convolution applies a convolutional filter while skipping positions according to a stride value.
Example:
Stride = 2
Kernel = 3×3
The filter moves 2 pixels at a time instead of 1.
This reduces spatial resolution while simultaneously learning feature transformations.
Strided convolution combines:
- Feature extraction
- Downsampling
In a single operation.
Minimal Conceptual Illustration
Pooling:Feature Map → Fixed DownsampleStrided Convolution:Feature Map → Learned Transform + Downsample
Pooling simplifies.
Strided convolution learns.
Mathematical Perspective
If:
Input size = N
Kernel size = K
Stride = S
Output size:
(N − K) / S + 1
Increasing stride reduces output resolution.
Pooling also uses stride but without learned weights.
Expressivity Difference
Pooling:
- Cannot adapt to data.
- Enforces invariance.
- Discards information rigidly.
Strided convolution:
- Learns how to compress.
- Retains discriminative information.
- May preserve more representational detail.
Strided convolution is more expressive.
Relationship to Receptive Fields
Both increase effective receptive field.
Strided convolution increases it while learning filters.
Pooling increases receptive field passively.
Learned downsampling influences hierarchical feature formation.
Information Loss Considerations
Pooling:
- May discard important features.
- Enforces strong invariance.
Strided convolution:
- May preserve more task-relevant information.
- Can reduce aliasing artifacts if designed properly.
However, strided convolutions can introduce checkerboard artifacts if misused.
Historical Evolution
Early CNNs (e.g., LeNet, AlexNet):
- Heavy use of max pooling.
Modern CNN architectures (e.g., ResNet variants):
- Increasing use of strided convolutions.
- Fewer pooling layers.
Modern design trend favors learned downsampling.
Computational Trade-Off
| Aspect | Pooling | Strided Convolution |
|---|---|---|
| Parameters | None | Yes |
| Learnable | No | Yes |
| Computation | Lower | Slightly higher |
| Flexibility | Low | High |
| Expressivity | Limited | Higher |
Pooling is simpler.
Strided convolution is more powerful.
When to Prefer Pooling
- Small models
- Low-resource environments
- Enforcing translation invariance
- Simplicity-focused designs
When to Prefer Strided Convolution
- Deep CNNs
- Large-scale image models
- Tasks requiring nuanced feature compression
- Architectures aiming for end-to-end learnability
Modern CNNs often replace pooling with strided convolution.
Architectural Implications
Choice affects:
- Information flow
- Feature hierarchy
- Robustness
- Spatial invariance
- Downsampling stability
Downsampling design influences generalization and robustness.
Relationship to Attention-Based Models
Transformers:
- Often avoid pooling entirely.
- Use patch embeddings (a form of strided projection).
Downsampling in CNNs and patch extraction in Vision Transformers share conceptual similarity.
Related Concepts
- Convolution Operation
- Pooling Layers
- Stride and Padding
- Receptive Fields
- Feature Maps
- Residual Networks (ResNet)
- Attention vs Convolution