top of page

The Hardware You Actually Need for Local LLMs

  • ShiftQuality Contributor
  • Aug 19, 2025
  • 10 min read

You want to run language models on your own machine. Maybe you've tried Ollama and it worked but felt slow. Maybe you're planning a purchase and don't want to waste money. Maybe you're staring at GPU specs and have no idea which numbers matter.

This guide is the hardware reality check. We'll cover what actually determines performance when running LLMs locally, what hardware you need for different model sizes, and how to spend your money wisely. No hype, no affiliate-driven recommendations for hardware you don't need. Just the facts.

VRAM Is the Bottleneck

If you take one thing from this post, take this: VRAM (Video RAM) on your GPU is the single most important factor for running LLMs locally. Not CPU speed. Not system RAM. Not disk speed. VRAM.

Here's why. A language model is, at its core, a massive collection of numerical weights. To run the model, those weights need to be in memory where the processor can access them quickly. For GPU inference, that means VRAM. If the model doesn't fit in VRAM, one of three things happens:

  1. The model partially loads to VRAM and the rest stays in system RAM. Inference works but it's significantly slower because data has to shuttle between system RAM and VRAM constantly.

  2. The model loads entirely to system RAM and runs on CPU. This works but is 5-20x slower than GPU inference.

  3. The model doesn't fit anywhere and you can't run it at all.

That's it. VRAM determines which models you can run at full speed. Everything else is secondary.

How Much VRAM Do You Need?

Here's a rough guide to VRAM requirements by model size. These assume 4-bit quantization (Q4_K_M), which is the most common format for local use and offers a good balance of quality and size.

| Model Size | VRAM Needed (Q4_K_M) | VRAM Needed (Q8_0) | VRAM Needed (FP16) | |-----------|----------------------|--------------------|--------------------| | 1-3B | 2-3 GB | 3-4 GB | 4-6 GB | | 7-8B | 4-6 GB | 8-10 GB | 14-16 GB | | 13-14B | 8-10 GB | 14-16 GB | 26-28 GB | | 30-34B | 18-22 GB | 32-36 GB | 60-68 GB | | 70B | 35-42 GB | 70-80 GB | 140+ GB |

These numbers include overhead for the KV cache (the memory used during inference to track the conversation context). Longer conversations and larger context windows use more KV cache, so your actual VRAM usage will vary.

The practical takeaway: with 8GB VRAM, you can run 7B models. With 16GB, you can run 13B models. With 24GB, you can run 30B models or very comfortably run 7-13B models with long context. With 48GB+, you can run 70B models.

GPU Options

Let's go through the actual GPUs available and what they're good for.

NVIDIA Consumer GPUs (GeForce RTX)

These are the most common and most cost-effective GPUs for local LLM use.

RTX 3060 12GB (~$250-300 used) The entry point. 12GB of VRAM is enough for 7B models at Q4 with room for a decent context window. The 3060 is one of the best value GPUs for AI because NVIDIA gave it 12GB of VRAM while the more expensive 3060 Ti only got 8GB. Performance is modest — expect 15-25 tokens per second on 7B models — but it works.

Note: avoid the 8GB version of the RTX 3060. The 12GB version is specifically what makes this card worthwhile.

RTX 3090 / 3090 Ti 24GB (~$700-900 used) The previous-generation sweet spot. 24GB of VRAM handles 13B models comfortably and can squeeze in 30B models with aggressive quantization. Performance is solid — 30-50 tokens/second on 7B models. These are excellent value on the used market.

RTX 4060 Ti 16GB (~$400-450 new) Newer architecture, 16GB of VRAM. Runs 7B models very well and 13B models at Q4. Faster per-VRAM-GB than the 3060 thanks to architectural improvements. A good new-purchase option if you want current-gen hardware without breaking the bank.

RTX 4090 24GB (~$1,600-2,000 new) The consumer king. 24GB of fast GDDR6X VRAM, massive CUDA core count, excellent tensor core performance. Runs 7B models at 60-80+ tokens/second and 13B models at 30-50 tokens/second. This is the GPU to buy if you want the best single-GPU local AI experience and you can justify the price.

RTX 5090 32GB (~$2,000+ new) The latest generation. 32GB of VRAM is a meaningful step up — it comfortably fits 30B models and handles 7-13B models with very long context windows. Newer architecture improvements deliver better performance per watt. Worth it if you're buying new in 2026, but don't upgrade from a 4090 just for this.

NVIDIA Professional GPUs

Professional GPUs cost more but offer more VRAM and features that matter for production use.

NVIDIA A6000 48GB (~$2,500-4,000 used) 48GB of VRAM in a single card. Runs 70B models at Q4. Designed for workstation use with features like ECC memory and better driver stability. A strong choice for a team inference server.

NVIDIA A100 40GB/80GB (~$3,000-8,000 used) The data center standard. The 80GB version runs 70B models at high quantization. NVLink support for multi-GPU configurations. Massively available on the used market as companies refresh to H100s. If you can deal with the higher power requirements and potentially needing server-class hardware to mount it, A100s are excellent value.

NVIDIA H100 80GB (~$20,000+ new) The current top of the line. Only makes sense if you're running large-scale inference for many users or need to run the largest models. Overkill for personal or small team use.

AMD GPUs

AMD GPUs can run LLMs through ROCm (AMD's CUDA equivalent). Support has improved significantly but is still behind NVIDIA in compatibility and community support.

RX 7900 XTX 24GB (~$900 new) 24GB of VRAM at a lower price than the RTX 4090. Performance is good when it works. The caveat: ROCm support varies by software. Ollama and llama.cpp support AMD GPUs, but you may encounter more setup friction and occasional compatibility issues. If you're comfortable troubleshooting driver issues, this is a strong value option.

RX 7900 GRE 16GB (~$500 new) 16GB at a budget price. Similar caveats about AMD software support. A reasonable budget option if you're willing to work through potential setup issues.

The AMD reality check: AMD hardware is capable and often cheaper per GB of VRAM. The software ecosystem is less mature. If you want things to "just work" with every tool and framework, NVIDIA is the safer bet. If you're comfortable with some extra configuration and occasional troubleshooting, AMD can save you money.

Apple Silicon: A Different Game

Apple's M-series chips (M1, M2, M3, M4 and their Pro/Max/Ultra variants) take a fundamentally different approach. Instead of separate CPU RAM and GPU VRAM, Apple Silicon uses unified memory that both the CPU and GPU share.

This has a huge implication for LLMs: the entire system RAM is your VRAM. An M4 Max with 128GB of unified memory can load models that would require an enterprise GPU on a PC.

Apple Silicon for LLMs

M1/M2/M3/M4 (base, 8-16GB) — Runs 7B models. Slower than a dedicated GPU but functional. The 8GB base model is tight — you'll be running small models with limited context.

M1/M2/M3/M4 Pro (18-48GB) — Runs 7B models comfortably, 13B models with quantization. Good performance, especially on the M3 Pro and M4 Pro with their improved GPU cores.

M1/M2/M3/M4 Max (32-128GB) — The serious option. 64GB runs 30B models comfortably. 96-128GB runs 70B models. The Max chips have higher memory bandwidth, which directly impacts inference speed because LLM inference is memory-bandwidth-bound.

M1/M2 Ultra (64-192GB) — In Mac Studio or Mac Pro form factor. 192GB of unified memory can run the largest open-weight models. Expensive, but it's a silent, low-power workstation that runs models requiring enterprise GPU hardware.

The Apple Silicon Tradeoff

Apple Silicon is slower per-token than a dedicated NVIDIA GPU. An RTX 4090 will generate tokens roughly 2-3x faster than an M4 Max for the same model size. But Apple Silicon can run larger models than any consumer NVIDIA GPU because of the unified memory advantage.

If you need to run a 70B model and you don't want server-class hardware: a Mac Studio with an M4 Max or Ultra is the most practical option. If you need maximum speed on models that fit in 24GB of VRAM: an RTX 4090 is faster and cheaper.

Memory bandwidth matters more than you think. LLM inference during token generation is almost entirely limited by how fast you can read model weights from memory. Apple Silicon's memory bandwidth varies significantly by chip variant:

  • M4: ~100 GB/s

  • M4 Pro: ~200-270 GB/s

  • M4 Max: ~400-540 GB/s

The Max variant is 4-5x faster than the base chip not because the GPU cores are that much faster, but because the memory bandwidth is that much higher. When buying Apple Silicon for LLMs, prioritize memory amount first, then memory bandwidth (which comes with the Pro/Max/Ultra tiers).

CPU Inference: When You Don't Have a GPU

You can run LLMs on just a CPU. It's slow, but it works, and for some use cases "slow but private" beats "fast but cloud."

What Matters for CPU Inference

RAM amount — Same principle as VRAM. The model needs to fit in memory. A 7B Q4 model needs about 4-6GB of free RAM.

RAM speed — Faster RAM means faster inference. DDR5 is measurably faster than DDR4 for LLM inference. If you're building a new system for CPU inference, get the fastest RAM your platform supports.

Core count — More cores help with the parallel matrix operations in LLM inference. But the returns diminish past 8-12 cores for inference specifically.

AVX-512 support — This instruction set extension provides wide vector operations that significantly speed up quantized inference. Intel CPUs from 11th gen and later (non-consumer chips) and AMD Zen 4 and later support AVX-512. llama.cpp specifically benefits from it.

Realistic CPU Performance

On a modern CPU (Ryzen 7/9, Intel 12th gen+), expect:

  • 7B Q4 model: 5-15 tokens/second

  • 13B Q4 model: 2-8 tokens/second

  • 30B+ models: Painfully slow, under 2 tokens/second

For context, comfortable reading speed for generated text is about 3-5 tokens/second. Below that, you're visibly waiting for each word. So CPU inference is usable for 7B models and borderline for 13B models.

System RAM Requirements

Even with a GPU, system RAM matters. Here's what it's used for:

  • Operating system and other applications — 8-16GB baseline

  • Model layers that don't fit on GPU — If your model is partially offloaded, the overflow goes to system RAM

  • KV cache overflow — Long conversations can push KV cache to system RAM

  • Vector database — If you're running RAG, your vector store uses RAM

  • Document processing — Ingesting and embedding documents uses RAM

Minimum: 16GB. This is tight. You'll need to close other applications when running models.

Recommended: 32GB. Comfortable for GPU inference with one model plus normal computer use.

Ideal: 64GB. Run models, vector databases, and everything else without worrying. Overkill for personal use, appropriate for a team server.

Storage

Models are big files. You need fast storage to load them quickly.

NVMe SSD — Required for reasonable model loading times. A 7B model loads from NVMe in 5-10 seconds. From a hard drive, expect 30-60 seconds.

Storage amount — Budget 10GB per small model, 30GB per medium model. If you're experimenting with many models, 1TB of free space is comfortable. If you're running a specific set of models, 500GB is plenty.

SATA SSD — Adequate. Loading times are 2-3x slower than NVMe but still far better than hard drives.

Hard drives — Don't. The loading time makes model switching painful, and if any model weights get paged to disk during inference, performance becomes unusable.

Realistic Build Recommendations

The Starter Build ($300-500)

Goal: Run 7B models, learn how everything works.

  • Used RTX 3060 12GB: ~$250

  • Put it in your existing desktop (assuming 16GB+ RAM and an adequate power supply)

That's it. If you have a desktop PC with a PCIe slot and a power supply with a spare 8-pin connector, adding a 3060 is the cheapest way to start running models at GPU speed. Check that your power supply has enough wattage — the 3060 needs a 550W+ PSU.

The Sweet Spot Build ($1,500-2,500)

Goal: Run 7-13B models comfortably for daily use.

  • RTX 4090 24GB: ~$1,800 (or RTX 3090 used for ~$800)

  • 32GB DDR5 RAM: ~$100

  • 1TB NVMe SSD: ~$80

  • Mid-range CPU (Ryzen 7 or Intel i5): ~$250

  • Motherboard, case, 850W PSU: ~$350

This system runs 7B models at blazing speed and 13B models at comfortable speed. It's also a perfectly good workstation for other tasks. The 4090 is expensive but it's genuinely the best value in current-gen consumer GPUs for AI work — nothing else comes close in VRAM + performance per dollar.

The Apple Alternative ($2,500-4,000)

Goal: Silent, portable, run larger models.

  • MacBook Pro M4 Pro 48GB: ~$2,900

  • Or Mac Mini M4 Pro 48GB: ~$2,100

48GB of unified memory runs 13B models at high quantization and 30B models at Q4. The experience is smooth, the machine is silent, and you get a great general-purpose computer. Performance is slower than an RTX 4090 for models that fit in 24GB, but the ability to run larger models and the form factor make it worthwhile for many people.

The Team Server ($4,000-8,000)

Goal: Serve 10-20 team members.

  • 2x RTX 3090 used (~$800 each) or 1x RTX 4090 + 1x RTX 3090

  • 64GB DDR5 ECC RAM: ~$250

  • 2TB NVMe: ~$150

  • Server/workstation CPU (Ryzen 9 or Xeon): ~$400

  • Server case with adequate cooling, 1200W PSU: ~$500

Multiple GPUs let you serve more concurrent users or run larger models with tensor parallelism. This is the starting point for a team inference server.

Common Mistakes

Buying too little VRAM. An 8GB GPU sounds reasonable until you realize it limits you to 7B models with short context. If you're spending money on a GPU for AI, buy the most VRAM you can afford.

Ignoring power requirements. High-end GPUs need serious power supplies. An RTX 4090 draws 450W under load. Check your PSU before you buy the card.

Forgetting about cooling. GPUs under sustained AI inference load generate significant heat. Consumer cases with bad airflow will throttle your GPU and reduce performance. Make sure your case has adequate airflow or your server room has appropriate cooling.

Buying based on benchmarks for the wrong task. Gaming benchmarks are irrelevant for LLM inference. A GPU that's great for gaming might have too little VRAM for AI. VRAM amount and memory bandwidth are what matter for LLMs, not shader performance.

Waiting for the next generation. There's always a better GPU coming. If you need AI capability now, buy now. The model ecosystem moves fast enough that today's hardware will be running better models in six months — your hardware won't be obsolete, the software will have gotten better at using it.

The Bottom Line

The hardware you need depends on what you're trying to do:

  • Just experimenting: Your existing computer, CPU inference, free

  • Personal use with decent speed: One GPU with 12-24GB VRAM, $300-2,000

  • Daily driver for serious work: RTX 4090 or Apple Silicon Max, $1,800-4,000

  • Team server: Multiple GPUs or professional GPU, $4,000-15,000

Start where you are. If you have a computer, you can run a small model on CPU today for free. If that's useful enough to justify spending money, buy the most VRAM you can afford. That's the whole strategy.

Comments


bottom of page