How Much VRAM Do You Actually Need?
The honest, data-driven answer. No marketing fluff, no "it depends" cop-outs — real numbers for every workload.
The Short Answer
8GB — You can experiment with 7-8B parameter models (Mistral, Llama 8B) and run Stable Diffusion. Fine for learning, not enough for serious work.
16GB — The minimum for real AI work. Runs 14B models at Q8, handles SDXL and AlphaFold for medium proteins. The entry point for scientific computing.
24GB — The sweet spot. Runs 70B models at Q4, handles virtually every image gen model, and comfortably runs AlphaFold on long proteins. This is what most professionals should target.
32GB+ — No compromises. 70B models at Q8, large video generation, ESMFold at full precision. For people who need to run the biggest models at the highest quality.
Why VRAM Is the Only Spec That Matters for AI
GPU marketing focuses on TFLOPS, clock speeds, and ray tracing cores. For AI workloads, none of that matters as much as VRAM. Here's why:
An AI model is essentially a giant matrix of numbers — its "weights." To run the model, every single weight must be loaded into VRAM. If the model doesn't fit, it either won't load at all, or it spills into system RAM which is 10-50x slower. A model running from system RAM generates 1-3 tokens per second. The same model fully in VRAM generates 30-80 tokens per second.
This is a binary cliff, not a gradient. Either the model fits in VRAM and runs fast, or it doesn't and the experience is unusable for interactive work. There is no "almost fits" — you're either above the line or below it.
For scientific computing, the same principle applies differently. AlphaFold's VRAM usage scales with protein sequence length — the attention matrices grow quadratically. scGPT's VRAM scales with dataset size. The model weights are small, but the working memory during computation is the bottleneck.
The practical implication: buy the most VRAM you can afford. A slower GPU with more VRAM will serve you better than a faster GPU with less. An RTX 3090 (24GB, older architecture) is more useful for AI than an RTX 4070 Ti (12GB, newer architecture) despite being a generation behind.
The Complete VRAM Compatibility Matrix
Every AI model we track, at every VRAM level. Green means it runs, yellow means quantized, red means it won't fit. What is quantization?
| Model | 8GB | 12GB | 16GB | 24GB | 32GB | 48GB | 80GB |
|---|---|---|---|---|---|---|---|
| Llama 3.1 70B 70B | No | No | No | No | Offload | Q4 | Q8 |
| Llama 3.1 8B 8B | Q8 | Q8 | FP16 | FP16 | FP16 | FP16 | FP16 |
| Qwen 2.5 72B 72B | No | No | No | No | Offload | Q4 | Q8 |
| Qwen 2.5 32B 32B | No | No | Offload | Q4 | Q8 | Q8 | FP16 |
| Qwen 2.5 14B 14B | Offload | Q4 | Q8 | Q8 | FP16 | FP16 | FP16 |
| Mistral 7B 7B | Q8 | Q8 | FP16 | FP16 | FP16 | FP16 | FP16 |
| DeepSeek R1 70B 70B | No | No | No | No | Offload | Q4 | Q8 |
| FLUX.1 Dev 12B | Offload | Q4 | Q8 | Q8 | FP16 | FP16 | FP16 |
| Stable Diffusion XL 6.6B | Q8 | Q8 | FP16 | FP16 | FP16 | FP16 | FP16 |
| Stable Diffusion 3.5 Large 8B | Q4 | Q8 | Q8 | FP16 | FP16 | FP16 | FP16 |
| HunyuanVideo 13B | No | Offload | Q4 | Q8 | Q8 | FP16 | FP16 |
| CogVideoX-5B 5B | Q4 | Q8 | Q8 | FP16 | FP16 | FP16 | FP16 |
| Mochi 1 10B | Offload | Q4 | Q8 | Q8 | FP16 | FP16 | FP16 |
| LTX Video 2B | Q8 | FP16 | FP16 | FP16 | FP16 | FP16 | FP16 |
| Stable Video Diffusion 1.5B | FP16 | FP16 | FP16 | FP16 | FP16 | FP16 | FP16 |
| Wan Video 14B 14B | Offload | Q4 | Q4 | Q8 | FP16 | FP16 | FP16 |
| Codestral 22B 22B | No | Offload | Q4 | Q8 | Q8 | FP16 | FP16 |
| Qwen 2.5 Coder 32B 32B | No | No | Offload | Q4 | Q8 | Q8 | FP16 |
| LLaVA 1.6 34B 34B | No | No | Offload | Q4 | Q4 | Q8 | FP16 |
| AlphaFold 2 93M | Q4 | Q8 | FP16 | FP16 | FP16 | FP16 | FP16 |
| ESMFold (ESM-2 15B) 15B | Offload | Q4 | Q8 | Q8 | FP16 | FP16 | FP16 |
| ESM-2 3B 3B | FP16 | FP16 | FP16 | FP16 | FP16 | FP16 | FP16 |
| scGPT 50M | Q8 | FP16 | FP16 | FP16 | FP16 | FP16 | FP16 |
| RFdiffusion 200M | Q4 | Q8 | FP16 | FP16 | FP16 | FP16 | FP16 |
| Fine-tune Llama 8B 8B | Q4 | Q4 | Q8 | Q8 | Q8 | FP16 | FP16 |
| Fine-tune Llama 70B 70B | No | No | No | No | Offload | Q4 | Q8 |
| Train SDXL LoRA 6.6B | Q4 | Q8 | Q8 | FP16 | FP16 | FP16 | FP16 |
| Train FLUX LoRA 12B | No | Offload | Q4 | Q8 | Q8 | FP16 | FP16 |
FP16 = full precision. Q8 = 8-bit quantized. Q4 = 4-bit quantized. Offload = partially in system RAM (very slow).
VRAM Tier Breakdown
8GB gets you started but hits walls quickly. You can run 7-8B parameter LLMs like Mistral 7B and Llama 3.1 8B at Q4-Q8 quantization. These are capable for basic chat, simple coding help, and text processing, but they're noticeably less intelligent than larger models.
For image generation, 8GB handles Stable Diffusion XL at Q8 and the lightweight video models (LTX Video, Stable Video Diffusion). You won't run FLUX.1 or the larger video generators.
For scientific computing, 8GB is enough for ESM-2 3B (protein embeddings), short protein AlphaFold predictions, and small single-cell experiments with scGPT. It's a viable entry point for learning, but you'll outgrow it fast.
Who should buy 8GB: Students learning AI/ML, hobbyists experimenting, anyone on a strict budget who wants to get started rather than wait.
12GB is an underrated sweet spot for budget builders. The RTX 3060 12GB is the cheapest NVIDIA card with enough VRAM for meaningful AI work, and it's widely available used for under $200.
You can run 14B models at Q4 (Qwen 2.5 14B is excellent at this size), 8B models at full FP16 precision, and most image generation models. AlphaFold runs medium-length proteins, and scGPT handles datasets up to ~30K cells.
The key limitation: 70B-class models (the frontier) are completely out of reach. You're capped at the "smart but not brilliant" tier of language models. For many use cases — coding assistance, quick queries, image generation, small-scale scientific computing — that's perfectly fine.
Who should buy 12GB: Budget-conscious AI experimenters, scientists running smaller analyses, anyone who wants more than 8GB without spending $500+.
16GB is where AI transitions from "toy" to "tool." You can run 14B models at Q8 (near-lossless quality), Qwen 2.5 32B at Q4, FLUX.1 Dev at Q8 for state-of-the-art image generation, and ESMFold at FP16 for fast protein structure predictions.
For scientific computing, 16GB is the practical minimum. AlphaFold handles most single-chain proteins at full precision. scGPT runs medium-scale experiments. RFdiffusion runs most protein design tasks. You're not limited to toy problems anymore.
The 70B models still don't fit — even at Q4, Llama 3.1 70B needs ~40GB. But the 14-32B models are remarkably capable in 2026. Qwen 2.5 14B at Q8 handles most coding, writing, and analysis tasks that people used to need 70B models for just a year ago.
Who should buy 16GB: Scientists running AlphaFold and single-cell analysis, developers wanting local code completion, AI enthusiasts who want real capability without the premium price.
24GB is the magic number in 2026. At this tier, the frontier opens up: Llama 3.1 70B and DeepSeek R1 70B run at Q4 with usable speed. Qwen 2.5 32B runs at Q8 — near-lossless quality on one of the best open models available. Every image generation model runs comfortably. Most video generation models fit.
For scientific computing, 24GB is professional-grade. AlphaFold handles long sequences and multimer predictions without VRAM anxiety. ESMFold runs at FP16 for most proteins. scGPT handles atlas-scale datasets. RFdiffusion runs complex multi-chain protein design. You can do real research, not just demos.
The RTX 3090 (24GB) and RTX 4090 (24GB) are the two most popular GPUs in the AI community for good reason. The 3090 is available used for ~$900 — making it the best VRAM-per-dollar consumer card available. The 4090 is faster but costs over twice as much.
Who should buy 24GB: Most people reading this guide. Researchers running production workloads, developers who want the best local AI models, scientists who need AlphaFold and ESMFold at full capability.
32GB removes nearly every constraint. 70B models run at Q4 with room for context, or you can run 32B models at Q8 — the quality sweet spot. FLUX.1 Dev runs at full FP16 precision. HunyuanVideo and the larger video generators become practical. ESMFold at full precision handles any protein.
The RTX 5090 is currently the only consumer card at 32GB, and its 1,792 GB/s memory bandwidth means those 70B models don't just fit — they run fast. If you can stomach the price premium over a 24GB card, this is the "set it and forget it" tier.
Who should buy 32GB: Power users who want the biggest models at high quality, video generation enthusiasts, researchers who need ESMFold at full precision for large proteins, anyone who doesn't want to think about VRAM limits.
At 48GB+ you're in workstation and data center territory. The RTX 6000 Ada (48GB), A100 (40/80GB), and H100 (80GB) serve organizations that need to run 70B+ models at high precision, train models, or process massive scientific datasets.
Llama 70B at Q8 (70GB) fits on a single A100 80GB or H100. At FP16 (140GB), you need multi-GPU setups. For scientific computing, 80GB lets you run AlphaFold on the longest protein complexes, train custom ESM models, and process single-cell datasets with millions of cells.
The used market for data center cards is interesting: the Tesla P40 (24GB) goes for ~$300, and A100 40GB cards are becoming accessible at ~$4,500. These lack display output and need server cooling, but for pure compute, the VRAM per dollar is unbeatable.
Who should buy 48GB+: Research labs, companies deploying AI at scale, scientists working with very large datasets, anyone training (not just running) models.
VRAM Requirements by Workload
Local LLMs (ChatGPT-style Models)
Running large language models locally is the most VRAM-hungry consumer AI workload. The rule of thumb: parameters x 2 = GB at FP16, parameters x 0.6 = GB at Q4. A 70B model needs ~140GB at full precision or ~40GB at Q4.
The practical reality: In 2026, the 14-32B models have caught up to where 70B models were a year ago. Qwen 2.5 14B at Q8 (14GB VRAM) handles most everyday tasks — coding, writing, analysis — that used to require 70B. The 70B models are still better for complex reasoning and nuanced work, but the gap is narrowing fast. Don't overspec if 16GB solves your actual use case.
Image Generation (Stable Diffusion, FLUX)
Image generation models have more forgiving VRAM requirements than LLMs. SDXL runs on 8GB at Q8. FLUX.1 Dev — the current state-of-the-art — needs 16GB for comfortable operation. The VRAM bottleneck for image gen is usually resolution and batch size, not the model itself.
Key insight: If you only want image generation, 12-16GB is plenty. If you want image gen and local LLMs, you need enough VRAM for the LLM — the image gen model will fit by default. This is why we recommend buying for your most demanding workload.
Video Generation
Video generation is the most VRAM-hungry creative workload. The best models (HunyuanVideo, Wan Video 14B) need 16-24GB for usable quality. Lightweight options exist (LTX Video runs on 4GB) but the quality difference is stark.
Scientific Computing (AlphaFold, Single-Cell, Protein Design)
Scientific computing VRAM requirements are different from AI models. The model weights are often small, but working memory during computation scales with input size. AlphaFold's attention grows quadratically with sequence length. scGPT's memory scales with cell count. The numbers below represent typical workloads — your specific needs depend on your data.
AlphaFold 2:The model weights are only ~200MB. VRAM usage is dominated by the MSA (Multiple Sequence Alignment) attention matrices. Short proteins (<500 residues) fit on 8GB. Long proteins (>1000 residues) or multimer predictions need 16GB+. ColabFold with MMseqs2 reduces memory by avoiding the full BFD database.
ESMFold: Unlike AlphaFold, ESMFold uses a single-sequence approach — no MSA needed. The trade-off is that the model itself is 15B parameters (30GB at FP16). But predictions complete in seconds, not minutes. If you're screening hundreds of proteins, ESMFold on a 24GB+ card is dramatically faster than AlphaFold.
scGPT: The model is small (50M parameters), but VRAM scales with your dataset. 10,000 cells at training time needs ~4-6GB. 100,000 cells needs ~12GB. If you're working with atlas-scale datasets (500K+ cells), you need 24GB+ or gradient checkpointing.
Code Completion & Generation
Local code models are increasingly competitive with cloud services. Qwen 2.5 Coder 32B is the current open-source leader, but it needs 20-32GB VRAM. Codestral 22B fits on 16GB at Q4. For IDE integration, speed matters — you want the model fully in VRAM for sub-second completions.
Training & Fine-tuning (The VRAM Multiplier)
Everything above covers inference — running pre-trained models. Training needs 2-4x more VRAM because the GPU must store not just the model weights, but also optimizer states (Adam uses 2x the model size), gradients, and activations for backpropagation. A model that runs on 16GB might need 40GB to train.
The breakthrough: LoRA and QLoRA make training accessible on consumer hardware. Instead of updating all parameters (full fine-tune), LoRA trains small adapter matrices — typically 1-5% of total parameters. QLoRA goes further by loading the base model at 4-bit precision. This slashes VRAM from "needs a data center" to "fits on your desktop GPU."
For LLM fine-tuning: QLoRA with Unsloth is the sweet spot. Fine-tune Llama 8B on 8GB, or Llama 70B on 40GB. Quality is within 1-2% of full fine-tuning for most tasks. Training takes hours, not days.
For image model training: LoRA is the standard — train custom styles, characters, and concepts in 15-60 minutes. SDXL LoRAs need 8-12GB. FLUX LoRAs need 16-24GB but produce dramatically better results.
The rule: If you plan to both run and train models, buy for the training requirement — it's always higher. A 24GB card that trains comfortably also runs everything.
VRAM vs System RAM — Can You Just Use More RAM?
Short answer: technically yes, practically no.
Tools like llama.cpp support "partial offloading" — keeping some model layers in system RAM and some in VRAM. This lets you run models that don't fully fit. The problem is speed: system RAM (DDR5-5600) has about 45 GB/s bandwidth. VRAM (GDDR6X on an RTX 4090) has 1,008 GB/s. That's a 20x difference.
In practice, a model that's 50% offloaded to system RAM runs roughly 3-5x slower than one fully in VRAM. A model that's 80% offloaded is essentially unusable for interactive work — you'll get 1-3 tokens per second instead of 30-80.
The rule: system RAM is for overflow, not as a substitute. Budget for at least 2x your GPU's VRAM in system RAM (e.g., 48-64GB DDR5 for a 24GB GPU). This gives the operating system, model loading, and other processes enough headroom without stealing from the GPU.
For scientific computing, system RAM matters more. AlphaFold's MSA database processing happens in CPU RAM before the GPU computation. Single-cell datasets are loaded into CPU memory first, then batched to the GPU. Budget 64-128GB for serious scientific workloads.
Common Mistakes When Buying for VRAM
Buying a fast GPU with less VRAM over a slower one with more
An RTX 4070 Ti Super (16GB) is faster than an RTX 3090 (24GB) at everything — except running models that need more than 16GB. If your workload needs 24GB, the 3090 is infinitely faster because it can actually run the model. Speed doesn't matter if the model doesn't load.
Assuming you can upgrade VRAM later
VRAM is soldered to the GPU. You cannot add more. The only upgrade path is buying a new card. Buy for your target workload, not your current one — the frontier models next year will be bigger than today's.
Buying AMD for AI because the specs look good
AMD GPUs have competitive specs on paper, but CUDA — NVIDIA's compute platform — is required by most AI frameworks. ROCm (AMD's alternative) is improving but still unreliable on consumer cards. Only buy AMD if gaming is your primary use case. For scientific computing, this is even more critical — AlphaFold, ESM, and scGPT all assume CUDA.
Thinking 8GB is "enough to start"
8GB was enough in 2023. In 2026, even entry-level models benefit from 12-16GB. The RTX 3060 12GB is available used for under $200 — that extra 4GB is worth more than the $50-100 savings of an 8GB card.
Focusing on MSRP instead of street price
GPU street prices can be 20-50% above MSRP due to demand. Our pricing uses real street prices, not manufacturer suggestions. A card with great $/GB at MSRP might be mediocre at street price.
How Much VRAM Will You Need in 2027?
Models are getting bigger, but they're also getting more efficient. Two competing trends:
Models grow: The frontier moves from 70B to 100B+ parameters. Video generation models are getting larger and longer. Scientific models are being trained on more data with larger context windows.
Efficiency improves: Better quantization methods (GGUF Q4_K_M is dramatically better than naive Q4 from two years ago). Mixture-of-experts architectures only activate a fraction of total parameters. Speculative decoding and other tricks reduce effective memory needs.
The practical answer: Buy one tier above what you need today. If 16GB solves your current workloads, buy 24GB. If 24GB works, 32GB gives you a year of headroom. The extra cost now is cheaper than replacing the entire GPU next year.