Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?
I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx:
File Name | Size |
---|---|
model.onnx | 654 MB |
model_fp16.onnx | 327 MB |
model_q4.onnx | 200 MB |
model_q4f16.onnx | 134 MB |
I understand that:
-
model.onnx
is the fp32 model, -
model_fp16.onnx
is the model whose weights are quantized tofp16
I don't understand the size of model_q4.onnx
and model_q4f16.onnx
-
Why is
model_q4.onnx
200 MB instead of 654 MB / 4 = 163.5 MB? I thoughtmodel_q4.onnx
meant that the weights are quantized to 4 bits. -
Why is
model_q4f16.onnx
134 MB instead of 654 MB / 4 = 163.5 MB? I thoughtmodel_q4f16.onnx
meant that the weights are quantized to 4 bits and activations are fp16, since https://llm.mlc.ai/docs/compilation/configure_quantization.html states:qAfB(_id)
, whereA
represents the number of bits for storing weights andB
represents the number of bits for storing activations.
and Why do activations need more bits (16bit) than weights (8bit) in tensor flow's neural network quantization framework? indicates that activations don't count toward the model size (understandably).
1 answer
Going from fp32 to fp16 is done usually by essentially just rounding the weights. To this end, you should expect pretty close to a 2:1 reduction in size. (Of course, not everything is weights, but the weights dominate.)
Going from fp32 (or fp16) to an integral type is no longer simply a matter of rounding. Keeping the activations in floating point and higher precision (e.g. fp16) is often very important for quality, what usually happens is that the quantized weights are dequantized on the fly so that a fp16 matmul can be performed. You see this in the second part of Figure 1 in the LUT-GEMM paper. Dequantization requires a scale and bias which are computed during the calibration process.
If we had a separate scale and bias for each component of the weight tensors, they'd be even larger than the weights, and you'd end up with an round-about and inefficient way of storing the original weights. Early version of this for LLMs instead used tensor-at-a-time dequantization where a single scale and bias was used for entire tensors. Suffice it to say, such a coarse scheme meant many weights would get suboptimal or even entirely inadequate scales and biases. So row-at-a-time dequantization followed which gave separate scales and biases per row lessening the compromise but increasing the size of the scales and biases. This trend has continued with groupwise dequantization which stores scales and biases per groups of a configurable group size g. Appendix E of the LUT-GEMM paper shows that the compression ratio varies from 2.5x to 4.0x as the group size increases from g=32 to full row-at-a-time for q4, and 3.5x to 5.3x for q3 (all relative to fp16). As you can see from this, the row-at-a-time compression ratios are what you'd naively expect.
This is almost certainly the bulk of the "discrepancy" you see. The difference between q4 and q4f16 is presumably just the size difference of these scales and biases. Except that LUT-GEMM doesn't actually do the dequantization. Instead, the scales and biases are baked into look-up tables (LUTs). Nevertheless, these LUTs presumably also vary in size with the activation data type, and they also have their own hyperparameters that can lead to variations in size even for a fixed level of quantization.
While I assume these models are uniformly quantized with the specified algorithms, there can be other factors that complicate things. Some outlier weights may be stored separately without quantization as discussed in the AWQ pape. You also may choose not to quantize some tensors at all (or only to fp16) such as the fully-connected (FC) weights.
0 comment threads