Large models can achieve high accuracy, but often at the cost of significant resource requirements.
Smaller models require less storage space, and consume less memory and compute during inference.
Compressed models enable deployment on resource-constrained devices such as smartphones, embedded systems, edge computing devices, and consumer electronics computers.
Efficient inference is also valuable for large corporations that serve large model inference over an API, allowing them to reduce computational costs and improve response times for users.
Pruning sparsifies a large model by setting some parameters to exactly zero.
Pruning criteria can be based on magnitudes of parameters, the statistical pattern of neural activations, Hessian values, etc.
[1][2] Quantization reduces the numerical precision of weights and activations.
For example, instead of storing weights as 32-bit floating-point numbers, they can be represented using 8-bit integers.
is small, this both reduces the number of parameters needed to represent
Low-rank approximations can be found by singular value decomposition (SVD).