Strided Convolution vs Pooling

Short Definition

Strided Convolution and Pooling are two techniques used to reduce spatial resolution in convolutional neural networks (CNNs). Pooling downsamples using fixed operations (e.g., max or average), while strided convolution learns the downsampling operation through trainable filters.

Definition

In convolutional neural networks, spatial downsampling reduces the width and height of feature maps to:

  • Increase receptive field
  • Reduce computational cost
  • Improve translation invariance
  • Extract higher-level features

Two primary methods are used:

  1. Pooling layers (fixed aggregation functions)
  2. Strided convolutions (learned downsampling)

The core difference:

Pooling is non-parametric.
Strided convolution is parametric.

I. Pooling

Pooling applies a fixed aggregation function over local regions.

Common types:

  • Max Pooling
  • Average Pooling

Example (2×2 max pooling):


Input:
[1 3
2 0]

Output:
3

Pooling reduces resolution by selecting or averaging values.

It introduces no trainable parameters.

II. Strided Convolution

A strided convolution applies a convolutional filter while skipping positions according to a stride value.

Example:

Stride = 2
Kernel = 3×3

The filter moves 2 pixels at a time instead of 1.

This reduces spatial resolution while simultaneously learning feature transformations.

Strided convolution combines:

  • Feature extraction
  • Downsampling

In a single operation.

Minimal Conceptual Illustration

Pooling:
Feature Map → Fixed Downsample
Strided Convolution:
Feature Map → Learned Transform + Downsample

Pooling simplifies.
Strided convolution learns.

Mathematical Perspective

If:

Input size = N
Kernel size = K
Stride = S

Output size:

(N − K) / S + 1

Increasing stride reduces output resolution.

Pooling also uses stride but without learned weights.

Expressivity Difference

Pooling:

  • Cannot adapt to data.
  • Enforces invariance.
  • Discards information rigidly.

Strided convolution:

  • Learns how to compress.
  • Retains discriminative information.
  • May preserve more representational detail.

Strided convolution is more expressive.

Relationship to Receptive Fields

Both increase effective receptive field.

Strided convolution increases it while learning filters.

Pooling increases receptive field passively.

Learned downsampling influences hierarchical feature formation.

Information Loss Considerations

Pooling:

  • May discard important features.
  • Enforces strong invariance.

Strided convolution:

  • May preserve more task-relevant information.
  • Can reduce aliasing artifacts if designed properly.

However, strided convolutions can introduce checkerboard artifacts if misused.

Historical Evolution

Early CNNs (e.g., LeNet, AlexNet):

  • Heavy use of max pooling.

Modern CNN architectures (e.g., ResNet variants):

  • Increasing use of strided convolutions.
  • Fewer pooling layers.

Modern design trend favors learned downsampling.

Computational Trade-Off

AspectPoolingStrided Convolution
ParametersNoneYes
LearnableNoYes
ComputationLowerSlightly higher
FlexibilityLowHigh
ExpressivityLimitedHigher

Pooling is simpler.
Strided convolution is more powerful.

When to Prefer Pooling

  • Small models
  • Low-resource environments
  • Enforcing translation invariance
  • Simplicity-focused designs

When to Prefer Strided Convolution

  • Deep CNNs
  • Large-scale image models
  • Tasks requiring nuanced feature compression
  • Architectures aiming for end-to-end learnability

Modern CNNs often replace pooling with strided convolution.

Architectural Implications

Choice affects:

  • Information flow
  • Feature hierarchy
  • Robustness
  • Spatial invariance
  • Downsampling stability

Downsampling design influences generalization and robustness.

Relationship to Attention-Based Models

Transformers:

  • Often avoid pooling entirely.
  • Use patch embeddings (a form of strided projection).

Downsampling in CNNs and patch extraction in Vision Transformers share conceptual similarity.

Related Concepts

  • Convolution Operation
  • Pooling Layers
  • Stride and Padding
  • Receptive Fields
  • Feature Maps
  • Residual Networks (ResNet)
  • Attention vs Convolution