Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

60%
+1 −0
Q&A Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?

Going from fp32 to fp16 is done usually by essentially just rounding the weights. To this end, you should expect pretty close to a 2:1 reduction in size. (Of course, not everything is weights, but ...

posted 2mo ago by Derek Elkins‭

Answer
#1: Initial revision by user avatar Derek Elkins‭ · 2024-11-09T01:53:45Z (about 2 months ago)
Going from fp32 to fp16 is done usually by essentially just rounding the weights. To this end, you should expect pretty close to a 2:1 reduction in size. (Of course, not everything is weights, but the weights dominate.)

Going from fp32 (or fp16) to an integral type is no longer simply a matter of rounding. Keeping the activations in floating point and higher precision (e.g. fp16) is often very important for quality, what usually happens is that the quantized weights are dequantized on the fly so that a fp16 matmul can be performed. You see this in the second part of Figure 1 in the [LUT-GEMM](https://arxiv.org/abs/2206.09557) paper. Dequantization requires a scale and bias which are computed during the calibration process.

If we had a separate scale and bias for each component of the weight tensors, they'd be even larger than the weights, and you'd end up with an round-about and inefficient way of storing the original weights. Early version of this for LLMs instead used tensor-at-a-time dequantization where a single scale and bias was used for entire tensors. Suffice it to say, such a coarse scheme meant many weights would get suboptimal or even entirely inadequate scales and biases. So row-at-a-time dequantization followed which gave separate scales and biases per row lessening the compromise but increasing the size of the scales and biases. This trend has continued with groupwise dequantization which stores scales and biases per groups of a configurable group size *g*. Appendix E of the LUT-GEMM paper shows that the compression ratio varies from 2.5x to 4.0x as the group size increases from *g*=32 to full row-at-a-time for q4, and 3.5x to 5.3x for q3 (all relative to fp16). As you can see from this, the row-at-a-time compression ratios are what you'd naively expect.

This is almost certainly the bulk of the "discrepancy" you see. The difference between q4 and q4f16 is presumably just the size difference of these scales and biases. Except that LUT-GEMM doesn't actually do the dequantization. Instead, the scales and biases are baked into look-up tables (LUTs). Nevertheless, these LUTs presumably also vary in size with the activation data type, and they also have their own hyperparameters that can lead to variations in size even for a fixed level of quantization.

While I assume these models are uniformly quantized with the specified algorithms, there can be other factors that complicate things. Some outlier weights may be stored separately without quantization as discussed in the [AWQ pape](https://arxiv.org/abs/2306.00978). You also may choose not to quantize some tensors at all (or only to fp16) such as the fully-connected (FC) weights.