Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?

File Name	Size
model.onnx	654 MB
model_fp16.onnx	327 MB
model_q4.onnx	200 MB
model_q4f16.onnx	134 MB

−0

The following users marked this post as Works for me:

User	Comment	Date
Franck Dernoncourt‭	(no comment)	Mar 21, 2025 at 03:54

Going from fp32 to fp16 is done usually by essentially just rounding the weights. To this end, you should expect pretty close to a 2:1 reduction in size. (Of course, not everything is weights, but the weights dominate.)

Going from fp32 (or fp16) to an integral type is no longer simply a matter of rounding. Keeping the activations in floating point and higher precision (e.g. fp16) is often very important for quality, what usually happens is that the quantized weights are dequantized on the fly so that a fp16 matmul can be performed. You see this in the second part of Figure 1 in the LUT-GEMM paper. Dequantization requires a scale and bias which are computed during the calibration process.

If we had a separate scale and bias for each component of the weight tensors, they'd be even larger than the weights, and you'd end up with an round-about and inefficient way of storing the original weights. Early version of this for LLMs instead used tensor-at-a-time dequantization where a single scale and bias was used for entire tensors. Suffice it to say, such a coarse scheme meant many weights would get suboptimal or even entirely inadequate scales and biases. So row-at-a-time dequantization followed which gave separate scales and biases per row lessening the compromise but increasing the size of the scales and biases. This trend has continued with groupwise dequantization which stores scales and biases per groups of a configurable group size g. Appendix E of the LUT-GEMM paper shows that the compression ratio varies from 2.5x to 4.0x as the group size increases from g=32 to full row-at-a-time for q4, and 3.5x to 5.3x for q3 (all relative to fp16). As you can see from this, the row-at-a-time compression ratios are what you'd naively expect.

This is almost certainly the bulk of the "discrepancy" you see. The difference between q4 and q4f16 is presumably just the size difference of these scales and biases. Except that LUT-GEMM doesn't actually do the dequantization. Instead, the scales and biases are baked into look-up tables (LUTs). Nevertheless, these LUTs presumably also vary in size with the activation data type, and they also have their own hyperparameters that can lead to variations in size even for a fixed level of quantization.

While I assume these models are uniformly quantized with the specified algorithms, there can be other factors that complicate things. Some outlier weights may be stored separately without quantization as discussed in the AWQ pape. You also may choose not to quantize some tensors at all (or only to fp16) such as the fully-connected (FC) weights.

posted 7 months ago

CC BY-SA 4.0

Derek Elkins‭

2719 reputation 0 53 267 12

Copy Link

Raw

Markdown

History

Communities

Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?

0 comment threads

1 answer

0 comment threads