Quantization

Short Definition

Quantization reduces numerical precision of model parameters.

Definition

Quantization converts high-precision values (such as 32-bit floats) into lower-precision representations. This reduces memory usage and improves inference speed, especially on constrained hardware.

Quantization is commonly applied after training, but can also be integrated during training.

Why It Matters

Quantization enables deployment on edge devices and accelerators.

How It Works (Conceptually)

  • Map continuous values to discrete levels
  • Trade precision for efficiency
  • Apply scaling to preserve accuracy

Minimal Python Example

Python
quantized = int(weight * scale)

Common Pitfalls

  • Quantizing too early
  • Ignoring accuracy degradation
  • Using unsupported hardware

Related Concepts

  • Model Compression
  • Deployment
  • Efficiency