GPU AI Benchmarks

Estimated tokens/sec for every GPU across popular AI models. The metric that actually matters for interactive AI.

How these numbers are calculated

These are physics-based estimates, not hardware benchmarks. We did not run tests on physical GPUs. LLM inference is memory-bandwidth-limited — each generated token requires reading the entire model from VRAM. The formula:

tok/s = memory_bandwidth / model_size × architecture_efficiency

Architecture efficiency factors (55-82% of theoretical) are calibrated against published community benchmarks from llama.cpp and LM Studio. Real-world results vary by ±20% depending on context length, software version, cooling, and system configuration. Use these as relative comparisons between GPUs, not as exact performance guarantees.

Excellent (≥60 tok/s)Fast (30-59)Usable (15-29)Slow (5-14)Very slow (<5)

Llama 3.1 70B — Tokens/sec Leaderboard

The frontier model benchmark. Only GPUs with 24GB+ VRAM can run this.

#GPUVRAMPrec.Tok/sSpeed

Full Performance Matrix

Estimated tok/s across multiple models and GPUs. Click any cell for the full analysis.

GPU	L70B	L8B	Q32B	Q14B	Code 22B
RTX 5090 32GB	1-3	65-80	34-42	37-46	50-62
RTX 4090 24GB	—	35-43	31-38	42-52	27-33
RTX 3090 Ti 24GB	—	31-38	27-34	37-46	24-29
RTX 5080 16GB	—	35-43	1-3	42-52	48-59
RTX 3090 24GB	—	29-35	25-31	35-43	22-27
RTX 5070 Ti 16GB	—	33-40	1-3	39-49	45-55
RTX 3080 10GB 10GB	—	49-61	—	46-57	1-3
RTX 4080 SUPER 16GB	—	25-31	1-3	31-38	35-43
RTX 4080 16GB	—	25-30	1-3	30-37	34-42
RTX 5070 12GB	—	52-64	—	48-60	1-3
RTX 4070 Ti SUPER 16GB	—	23-29	1-3	28-34	32-39
RTX 3070 Ti 8GB	—	39-49	—	1-3	—
RTX 4070 SUPER 12GB	—	37-45	—	34-42	1-3
RTX 4070 12GB	—	37-45	—	34-42	1-3
RTX 4070 Ti 12GB	—	37-45	—	34-42	1-3
RTX 3070 8GB	—	29-36	—	1-3	—
RTX 3060 Ti 8GB	—	29-36	—	1-3	—
RTX 3060 12GB 12GB	—	23-29	—	22-27	1-3
RTX 4060 Ti 16GB 16GB	—	10-12	1-3	12-15	14-17
RTX 4060 Ti 8GB 8GB	—	21-26	—	1-3	—
RTX 4060 8GB	—	20-24	—	1-3	—

How We Estimate Performance

LLM inference is memory-bandwidth limited. Each generated token requires reading the entire model from VRAM. The theoretical maximum is:

tok/s = memory_bandwidth / model_size × efficiency

Real-world throughput is 55-82% of theoretical due to KV cache overhead, attention computation, dequantization cost, and driver/framework overhead. We apply architecture-specific efficiency factors:

H100 (HBM3): ~82% — NVIDIA RTX 50 (GDDR7): ~76%
RTX 40 (GDDR6X): ~72% — RTX 30 (GDDR6X): ~64%
AMD RX 7000: ~48% (ROCm overhead) — Tesla P40: ~45% (old arch)

These are estimates for single-user, autoregressive generation at short-to-medium context lengths. Longer contexts reduce tok/s due to larger KV caches. Batched serving (vLLM) achieves higher aggregate throughput.

VRAM Guide Browse All GPUs Quantization Guide

GPU AI Benchmarks

How these numbers are calculated

Llama 3.1 8B — Tokens/sec Leaderboard

Llama 3.1 70B — Tokens/sec Leaderboard

Full Performance Matrix

How We Estimate Performance