GPU AI Benchmarks
Estimated tokens/sec for every GPU across popular AI models. The metric that actually matters for interactive AI.
How these numbers are calculated
These are physics-based estimates, not hardware benchmarks. We did not run tests on physical GPUs. LLM inference is memory-bandwidth-limited — each generated token requires reading the entire model from VRAM. The formula:
tok/s = memory_bandwidth / model_size × architecture_efficiency
Architecture efficiency factors (55-82% of theoretical) are calibrated against published community benchmarks from llama.cpp and LM Studio. Real-world results vary by ±20% depending on context length, software version, cooling, and system configuration. Use these as relative comparisons between GPUs, not as exact performance guarantees.
Llama 3.1 8B — Tokens/sec Leaderboard
The most popular local LLM. Shows best achievable precision per GPU.
Llama 3.1 70B — Tokens/sec Leaderboard
The frontier model benchmark. Only GPUs with 24GB+ VRAM can run this.
Full Performance Matrix
Estimated tok/s across multiple models and GPUs. Click any cell for the full analysis.
How We Estimate Performance
LLM inference is memory-bandwidth limited. Each generated token requires reading the entire model from VRAM. The theoretical maximum is:
Real-world throughput is 55-82% of theoretical due to KV cache overhead, attention computation, dequantization cost, and driver/framework overhead. We apply architecture-specific efficiency factors:
- H100 (HBM3): ~82% — NVIDIA RTX 50 (GDDR7): ~76%
- RTX 40 (GDDR6X): ~72% — RTX 30 (GDDR6X): ~64%
- AMD RX 7000: ~48% (ROCm overhead) — Tesla P40: ~45% (old arch)
These are estimates for single-user, autoregressive generation at short-to-medium context lengths. Longer contexts reduce tok/s due to larger KV caches. Batched serving (vLLM) achieves higher aggregate throughput.