Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?

+2
−0

I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx:

File Name Size
model.onnx 654 MB
model_fp16.onnx 327 MB
model_q4.onnx 200 MB
model_q4f16.onnx 134 MB

I understand that:

  • model.onnx is the fp32 model,
  • model_fp16.onnx is the model whose weights are quantized to fp16

I don't understand the size of model_q4.onnx and model_q4f16.onnx

  1. Why is model_q4.onnx 200 MB instead of 654 MB / 4 = 163.5 MB? I thought model_q4.onnx meant that the weights are quantized to 4 bits.

  2. Why is model_q4f16.onnx 134 MB instead of 654 MB / 4 = 163.5 MB? I thought model_q4f16.onnx meant that the weights are quantized to 4 bits and activations are fp16, since https://llm.mlc.ai/docs/compilation/configure_quantization.html states:

    qAfB(_id), where A represents the number of bits for storing weights and B represents the number of bits for storing activations.

and Why do activations need more bits (16bit) than weights (8bit) in tensor flow's neural network quantization framework? indicates that activations don't count toward the model size (understandably).

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

1 answer

+1
−0

Going from fp32 to fp16 is done usually by essentially just rounding the weights. To this end, you should expect pretty close to a 2:1 reduction in size. (Of course, not everything is weights, but the weights dominate.)

Going from fp32 (or fp16) to an integral type is no longer simply a matter of rounding. Keeping the activations in floating point and higher precision (e.g. fp16) is often very important for quality, what usually happens is that the quantized weights are dequantized on the fly so that a fp16 matmul can be performed. You see this in the second part of Figure 1 in the LUT-GEMM paper. Dequantization requires a scale and bias which are computed during the calibration process.

If we had a separate scale and bias for each component of the weight tensors, they'd be even larger than the weights, and you'd end up with an round-about and inefficient way of storing the original weights. Early version of this for LLMs instead used tensor-at-a-time dequantization where a single scale and bias was used for entire tensors. Suffice it to say, such a coarse scheme meant many weights would get suboptimal or even entirely inadequate scales and biases. So row-at-a-time dequantization followed which gave separate scales and biases per row lessening the compromise but increasing the size of the scales and biases. This trend has continued with groupwise dequantization which stores scales and biases per groups of a configurable group size g. Appendix E of the LUT-GEMM paper shows that the compression ratio varies from 2.5x to 4.0x as the group size increases from g=32 to full row-at-a-time for q4, and 3.5x to 5.3x for q3 (all relative to fp16). As you can see from this, the row-at-a-time compression ratios are what you'd naively expect.

This is almost certainly the bulk of the "discrepancy" you see. The difference between q4 and q4f16 is presumably just the size difference of these scales and biases. Except that LUT-GEMM doesn't actually do the dequantization. Instead, the scales and biases are baked into look-up tables (LUTs). Nevertheless, these LUTs presumably also vary in size with the activation data type, and they also have their own hyperparameters that can lead to variations in size even for a fixed level of quantization.

While I assume these models are uniformly quantized with the specified algorithms, there can be other factors that complicate things. Some outlier weights may be stored separately without quantization as discussed in the AWQ pape. You also may choose not to quantize some tensors at all (or only to fp16) such as the fully-connected (FC) weights.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »