Short Definition

Strided Convolution and Pooling are two techniques used to reduce spatial resolution in convolutional neural networks (CNNs). Pooling downsamples using fixed operations (e.g., max or average), while strided convolution learns the downsampling operation through trainable filters.

Definition

In convolutional neural networks, spatial downsampling reduces the width and height of feature maps to:

Increase receptive field
Reduce computational cost
Improve translation invariance
Extract higher-level features

Two primary methods are used:

Pooling layers (fixed aggregation functions)
Strided convolutions (learned downsampling)

The core difference:

Pooling is non-parametric.
Strided convolution is parametric.

I. Pooling

Pooling applies a fixed aggregation function over local regions.

Common types:

Max Pooling
Average Pooling

Example (2×2 max pooling):

Input:
[1 3
2 0]

Output:
3

Pooling reduces resolution by selecting or averaging values.

It introduces no trainable parameters.

II. Strided Convolution

A strided convolution applies a convolutional filter while skipping positions according to a stride value.

Example:

Stride = 2
Kernel = 3×3

The filter moves 2 pixels at a time instead of 1.

This reduces spatial resolution while simultaneously learning feature transformations.

Strided convolution combines:

Feature extraction
Downsampling

In a single operation.

Minimal Conceptual Illustration

			
Pooling:
Feature Map → Fixed Downsample
Strided Convolution:
Feature Map → Learned Transform + Downsample

Pooling simplifies.
Strided convolution learns.

Mathematical Perspective

If:

Input size = N
Kernel size = K
Stride = S

Output size:

(N − K) / S + 1

Increasing stride reduces output resolution.

Pooling also uses stride but without learned weights.

Expressivity Difference

Pooling:

Cannot adapt to data.
Enforces invariance.
Discards information rigidly.

Strided convolution:

Learns how to compress.
Retains discriminative information.
May preserve more representational detail.

Strided convolution is more expressive.

Relationship to Receptive Fields

Both increase effective receptive field.

Strided convolution increases it while learning filters.

Pooling increases receptive field passively.

Learned downsampling influences hierarchical feature formation.

Information Loss Considerations

Pooling:

May discard important features.
Enforces strong invariance.

Strided convolution:

May preserve more task-relevant information.
Can reduce aliasing artifacts if designed properly.

However, strided convolutions can introduce checkerboard artifacts if misused.

Historical Evolution

Early CNNs (e.g., LeNet, AlexNet):

Heavy use of max pooling.

Modern CNN architectures (e.g., ResNet variants):

Increasing use of strided convolutions.
Fewer pooling layers.

Modern design trend favors learned downsampling.

Computational Trade-Off

Aspect	Pooling	Strided Convolution
Parameters	None	Yes
Learnable	No	Yes
Computation	Lower	Slightly higher
Flexibility	Low	High
Expressivity	Limited	Higher

Pooling is simpler.
Strided convolution is more powerful.

When to Prefer Pooling

Small models
Low-resource environments
Enforcing translation invariance
Simplicity-focused designs

When to Prefer Strided Convolution

Deep CNNs
Large-scale image models
Tasks requiring nuanced feature compression
Architectures aiming for end-to-end learnability

Modern CNNs often replace pooling with strided convolution.

Architectural Implications

Choice affects:

Information flow
Feature hierarchy
Robustness
Spatial invariance
Downsampling stability

Downsampling design influences generalization and robustness.

Relationship to Attention-Based Models

Transformers:

Often avoid pooling entirely.
Use patch embeddings (a form of strided projection).

Downsampling in CNNs and patch extraction in Vision Transformers share conceptual similarity.

Related Concepts

Convolution Operation
Pooling Layers
Stride and Padding
Receptive Fields
Feature Maps
Residual Networks (ResNet)
Attention vs Convolution

Neural Network Lexicon

Strided Convolution vs Pooling