What is Quantization?
A beginner's guide to Q4, Q8, and FP16 — and how they affect your AI experience.
TL;DR
AI models store their knowledge as numbers. Quantization compresses these numbers to use less memory, so the model fits on your GPU. Less memory = lower quality, but modern quantization methods (like GPTQ and GGUF Q4_K_M) are smart enough that the quality loss is small. The rule of thumb: use the highest precision that fits in your VRAM.
Why Does This Matter?
A 70 billion parameter model at full precision (FP16) needs about 140GB of memory. No consumer GPU has 140GB of VRAM. The RTX 5090 — the most powerful consumer GPU — has 32GB.
So how do people run 70B models on consumer hardware? Quantization. By compressing each parameter from 16 bits down to 4 bits, that 140GB model shrinks to about 40GB — still tight, but manageable with some CPU offloading on a 32GB GPU.
The trade-off is quality. More compression = smaller model = fits on cheaper hardware, but the model becomes slightly less intelligent. The art is finding the sweet spot for your GPU.
Precision Levels Explained
FP16 (Full Precision)
When to use: When your GPU has enough VRAM and you want the best possible quality
Q8 (8-bit Quantization)
When to use: The recommended default for most users. Best balance of quality and efficiency.
Q4 (4-bit Quantization)
When to use: When your GPU can't fit Q8. The minimum for usable quality on large models.
Q2 / Q3 (Extreme Quantization)
When to use: Generally not recommended. Consider using a smaller model at Q4/Q8 instead.
Quick VRAM Calculation
To estimate how much VRAM a model needs:
FP16: parameters × 2 = GB needed
Q8: parameters × 1 = GB needed
Q4: parameters × 0.6 = GB needed
Example: Llama 3.1 70B at Q4 = 70 × 0.6 = ~42GB. That fits on an A100 40GB (tight) or RTX 5090 32GB (needs partial offload).
Which Format Should I Download?
GGUF (recommended for most users) — the standard format for llama.cpp and Ollama. Files are named like model-Q4_K_M.gguf. The "K_M" means it uses a smart mixed quantization that preserves quality better than naive Q4.
GPTQ — GPU-optimized quantization format. Faster inference than GGUF on NVIDIA GPUs but less flexible. Used with AutoGPTQ or ExLlama.
AWQ — newer GPU quantization format. Similar to GPTQ but often slightly better quality. Used with vLLM and other serving frameworks.
Rule of thumb: If you're using Ollama, it handles formats automatically. If downloading from HuggingFace, grab the Q4_K_M GGUF to start — it's the best balance of size and quality.