Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Post History
Going from fp32 to fp16 is done usually by essentially just rounding the weights. To this end, you should expect pretty close to a 2:1 reduction in size. (Of course, not everything is weights, but ...
Answer
#1: Initial revision
Going from fp32 to fp16 is done usually by essentially just rounding the weights. To this end, you should expect pretty close to a 2:1 reduction in size. (Of course, not everything is weights, but the weights dominate.) Going from fp32 (or fp16) to an integral type is no longer simply a matter of rounding. Keeping the activations in floating point and higher precision (e.g. fp16) is often very important for quality, what usually happens is that the quantized weights are dequantized on the fly so that a fp16 matmul can be performed. You see this in the second part of Figure 1 in the [LUT-GEMM](https://arxiv.org/abs/2206.09557) paper. Dequantization requires a scale and bias which are computed during the calibration process. If we had a separate scale and bias for each component of the weight tensors, they'd be even larger than the weights, and you'd end up with an round-about and inefficient way of storing the original weights. Early version of this for LLMs instead used tensor-at-a-time dequantization where a single scale and bias was used for entire tensors. Suffice it to say, such a coarse scheme meant many weights would get suboptimal or even entirely inadequate scales and biases. So row-at-a-time dequantization followed which gave separate scales and biases per row lessening the compromise but increasing the size of the scales and biases. This trend has continued with groupwise dequantization which stores scales and biases per groups of a configurable group size *g*. Appendix E of the LUT-GEMM paper shows that the compression ratio varies from 2.5x to 4.0x as the group size increases from *g*=32 to full row-at-a-time for q4, and 3.5x to 5.3x for q3 (all relative to fp16). As you can see from this, the row-at-a-time compression ratios are what you'd naively expect. This is almost certainly the bulk of the "discrepancy" you see. The difference between q4 and q4f16 is presumably just the size difference of these scales and biases. Except that LUT-GEMM doesn't actually do the dequantization. Instead, the scales and biases are baked into look-up tables (LUTs). Nevertheless, these LUTs presumably also vary in size with the activation data type, and they also have their own hyperparameters that can lead to variations in size even for a fixed level of quantization. While I assume these models are uniformly quantized with the specified algorithms, there can be other factors that complicate things. Some outlier weights may be stored separately without quantization as discussed in the [AWQ pape](https://arxiv.org/abs/2306.00978). You also may choose not to quantize some tensors at all (or only to fp16) such as the fully-connected (FC) weights.