GPU AI Benchmarks

Estimated tokens/sec for every GPU across popular AI models. The metric that actually matters for interactive AI.

How these numbers are calculated

These are physics-based estimates, not hardware benchmarks. We did not run tests on physical GPUs. LLM inference is memory-bandwidth-limited — each generated token requires reading the entire model from VRAM. The formula:

tok/s = memory_bandwidth / model_size × architecture_efficiency

Architecture efficiency factors (55-82% of theoretical) are calibrated against published community benchmarks from llama.cpp and LM Studio. Real-world results vary by ±20% depending on context length, software version, cooling, and system configuration. Use these as relative comparisons between GPUs, not as exact performance guarantees.

Excellent (≥60 tok/s)Fast (30-59)Usable (15-29)Slow (5-14)Very slow (<5)

Llama 3.1 70B — Tokens/sec Leaderboard

The frontier model benchmark. Only GPUs with 24GB+ VRAM can run this.

How We Estimate Performance

LLM inference is memory-bandwidth limited. Each generated token requires reading the entire model from VRAM. The theoretical maximum is:

tok/s = memory_bandwidth / model_size × efficiency

Real-world throughput is 55-82% of theoretical due to KV cache overhead, attention computation, dequantization cost, and driver/framework overhead. We apply architecture-specific efficiency factors:

  • H100 (HBM3): ~82% — NVIDIA RTX 50 (GDDR7): ~76%
  • RTX 40 (GDDR6X): ~72% — RTX 30 (GDDR6X): ~64%
  • AMD RX 7000: ~48% (ROCm overhead) — Tesla P40: ~45% (old arch)

These are estimates for single-user, autoregressive generation at short-to-medium context lengths. Longer contexts reduce tok/s due to larger KV caches. Batched serving (vLLM) achieves higher aggregate throughput.