What is Quantization?

A beginner's guide to Q4, Q8, and FP16 — and how they affect your AI experience.

TL;DR

AI models store their knowledge as numbers. Quantization compresses these numbers to use less memory, so the model fits on your GPU. Less memory = lower quality, but modern quantization methods (like GPTQ and GGUF Q4_K_M) are smart enough that the quality loss is small. The rule of thumb: use the highest precision that fits in your VRAM.

Why Does This Matter?

A 70 billion parameter model at full precision (FP16) needs about 140GB of memory. No consumer GPU has 140GB of VRAM. The RTX 5090 — the most powerful consumer GPU — has 32GB.

So how do people run 70B models on consumer hardware? Quantization. By compressing each parameter from 16 bits down to 4 bits, that 140GB model shrinks to about 40GB — still tight, but manageable with some CPU offloading on a 32GB GPU.

The trade-off is quality. More compression = smaller model = fits on cheaper hardware, but the model becomes slightly less intelligent. The art is finding the sweet spot for your GPU.

Precision Levels Explained

FP16 (Full Precision)

Memory per parameter16 bits — 2x parameters
Example70B model ≈ 140GB
QualityMaximum — no degradation at all
SpeedBaseline speed

When to use: When your GPU has enough VRAM and you want the best possible quality

Q8 (8-bit Quantization)

Memory per parameter8 bits — 1x parameters
Example70B model ≈ 70GB
QualityNear-lossless — virtually identical to FP16 in blind tests
SpeedOften slightly faster than FP16 due to less memory bandwidth needed

When to use: The recommended default for most users. Best balance of quality and efficiency.

Q4 (4-bit Quantization)

Memory per parameter4 bits — ~0.6x parameters
Example70B model ≈ 40GB
QualityGood for most tasks. Slight degradation on complex reasoning, math, and nuanced writing.
SpeedFaster than Q8 — less data to move through memory

When to use: When your GPU can't fit Q8. The minimum for usable quality on large models.

Q2 / Q3 (Extreme Quantization)

Memory per parameter2 bits — ~0.3-0.4x parameters
Example70B model ≈ 20-28GB
QualityNoticeable degradation. Models may produce incoherent or repetitive output on complex tasks.
SpeedFastest, but quality trade-off usually isn't worth it

When to use: Generally not recommended. Consider using a smaller model at Q4/Q8 instead.

Quick VRAM Calculation

To estimate how much VRAM a model needs:

FP16: parameters × 2 = GB needed

Q8: parameters × 1 = GB needed

Q4: parameters × 0.6 = GB needed

Example: Llama 3.1 70B at Q4 = 70 × 0.6 = ~42GB. That fits on an A100 40GB (tight) or RTX 5090 32GB (needs partial offload).

Which Format Should I Download?

GGUF (recommended for most users) — the standard format for llama.cpp and Ollama. Files are named like model-Q4_K_M.gguf. The "K_M" means it uses a smart mixed quantization that preserves quality better than naive Q4.

GPTQ — GPU-optimized quantization format. Faster inference than GGUF on NVIDIA GPUs but less flexible. Used with AutoGPTQ or ExLlama.

AWQ — newer GPU quantization format. Similar to GPTQ but often slightly better quality. Used with vLLM and other serving frameworks.

Rule of thumb: If you're using Ollama, it handles formats automatically. If downloading from HuggingFace, grab the Q4_K_M GGUF to start — it's the best balance of size and quality.