Quantization & Compression

A 70-billion-parameter model in FP32 takes 280GB of memory. Multiple GPUs just to load it. But you can shrink it to 35GB (INT4) and lose almost nothing. Like compressing RAW to JPEG — technically losing info, but your eyes can't tell.

That's what quantization does for neural networks. It takes those absurdly large models and squeezes them down to a fraction of their size by reducing the precision of every single weight. Instead of storing each number with 32 bits of precision, you use 8 bits. Or 4 bits. Sometimes even less.

And here's the kicker: for most models, you barely notice the quality drop. A 4-bit quantized Llama can run on a laptop and still give you coherent answers. The same model in full precision would need a server rack. This isn't some niche optimization trick — it's how local AI actually works in practice.

Making models smaller without breaking them

Neural networks are packed with millions or billions of floating-point numbers called weights. Each weight is a knob that slightly adjusts how information flows through the model. In their original training format (usually FP32 or BF16), every single weight gets stored with extreme precision — 32 bits or 16 bits per number.

Quantization shrinks models dramatically

FP32 Model (280GB)

Quantization

INT4 Model (35GB)

Nearly Same Quality

But here's the thing: most of those weights cluster around zero. They're not random numbers scattered across the entire range of possible values. They follow patterns. They're redundant. And they definitely don't need 32 bits of precision to do their job.

Quantization exploits that redundancy. Instead of storing each weight with full 32-bit floating-point precision, you map it to a much smaller set of possible values — say, 256 values for INT8, or just 16 values for INT4. You're intentionally throwing away precision. But because the weights are clustered and somewhat redundant, the model still behaves almost identically.

The payoff is huge. A 70B-parameter model drops from 280GB to 35GB. That's the difference between needing multiple enterprise GPUs and running locally on a MacBook. It's why tools like llama.cpp and Ollama exist — they let you run quantized models efficiently on consumer hardware.

From 32 bits to 4 bits

Quantization is conceptually simple: map a continuous range of floating-point numbers to a discrete set of integers. But the devil is in the details — how you do that mapping, when you do it, and what tricks you use to minimize quality loss. Let's walk through the key approaches.

What is quantization?

At its core, quantization is just rounding. You take a floating-point weight like 0.3847 and map it to the nearest value in a much smaller set of allowed values.

FP32: 0.3847291 (exact)

INT8: 0.3858 (rounded to nearest 1/127)

INT4: 0.4286 (rounded to nearest 1/7)

Fewer bits = less memory. An FP32 weight takes 4 bytes. INT8 takes 1 byte. INT4 takes half a byte. Multiply that across billions of parameters and you've shrunk the entire model by 4x or 8x.

The trade-off is obvious: you lose precision. The question is whether that precision actually mattered for the model's behavior. Spoiler: usually it doesn't.

Step 1 of 4

Now try it yourself

Time to see the trade-offs in action. The simulation below lets you toggle between different precisions and watch how a floating-point value gets quantized. Move the slider to try different numbers. Notice how INT4 snaps values to coarse steps, while FP16 barely changes anything. Check the model stats panel to see the size, speed, and quality implications of each choice.

Precision vs Quality Trade-off

Value quantization

-1.0

0.0

1.0

Original: 0.3847

Error: < 0.0001

Model Size

280GB

70B params

Inference Speed

vs FP32

Quality Score

100

out of 100

Bits per Weight

precision

Comparison across precisions

Model Size (lower is better)

FP32 (32-bit)

280GB

FP16 (16-bit)

140GB

INT8 (8-bit)

70GB

INT4 (4-bit)

35GB

Inference Speed (higher is better)

FP32 (32-bit)

FP16 (16-bit)

INT8 (8-bit)

INT4 (4-bit)

Quality Score (higher is better)

FP32 (32-bit)

100

FP16 (16-bit)

INT8 (8-bit)

INT4 (4-bit)

Sample weight values at FP32 (32-bit)

0.8234

-0.4521

0.1567

-0.9012

0.3421

-0.6789

0.5234

0.0987

-0.2145

0.7654

-0.3987

0.6123

Notice how lower precision loses decimal resolution, especially in INT4 where values snap to fixed steps.

Key Takeaways

Quantization reduces the precision of model weights from 32-bit or 16-bit floats down to 8-bit or 4-bit integers. This shrinks model size dramatically — often 4x to 8x smaller.
Post-training quantization (GPTQ, AWQ) works by quantizing a pre-trained model layer by layer, using calibration data to minimize reconstruction error. It's the most common approach for open-source LLMs.
GGUF is the standard file format for running quantized models locally with llama.cpp. Q4_K_M (4-bit k-quant) is the sweet spot: small size, near-original quality.
Quantization to INT4 typically retains 92-95% of the original model quality. For most tasks, the difference is imperceptible — but the memory savings are massive.
Beyond quantization, pruning (removing near-zero weights) and knowledge distillation (training a smaller student model) offer additional compression paths. These can be combined for extreme size reductions.

Common Misconceptions

"Quantization always hurts quality." — Not really. With modern methods like GPTQ and AWQ, INT4 models often perform nearly identically to FP16 on standard benchmarks. The precision loss matters less than you'd expect.
"You need special hardware to use quantized models." — Nope. GGUF models run on CPUs just fine. They're actually designed for consumer hardware. A MacBook can run a 70B INT4 model with no GPU at all.
"Quantization is just rounding weights randomly." — That would be terrible. Modern quantization methods are calibrated — they use real data to figure out how to round intelligently. Some weights get more precision, others get less, based on importance.
"Lower bits always mean faster inference." — Not always. INT4 is smaller, but if your hardware doesn't have optimized INT4 kernels, it might run slower than INT8 or even FP16. The speedup depends on the inference engine and hardware support.

Making models smaller without breaking them

Quantization shrinks models dramatically

From 32 bits to 4 bits

What is quantization?

Now try it yourself

Precision vs Quality Trade-off

Comparison across precisions

Sample weight values at FP32 (32-bit)

Key Takeaways

Common Misconceptions

Quick check

Why is GGUF the most common format for running LLMs locally on consumer hardware?

Inference & Serving

Related Topics

Model Internals

Inference & Serving

Evaluation & Benchmarks