Why is my local LLM running so slowly (2-5 tokens per second)?

If your local LLM is generating at 2–5 tokens per second, the model has exceeded your GPU's VRAM capacity and is offloading layers to your system RAM. GPU VRAM operates at 300–900 GB/s bandwidth. System RAM (DDR4/DDR5) operates at 50–100 GB/s bandwidth. When AI inference has to read model weights from system RAM instead of VRAM, it's up to 10–50x slower. The fix: run a smaller model or a more aggressively quantized version that fits entirely in your GPU's VRAM. A model running 100% in VRAM generates at 30–100+ tokens/second.

What local LLM model should I run on an 8GB GPU?

On an 8GB GPU (RTX 5060, RTX 4060, RTX 3070): the sweet spot is Llama-3 8B at Q4_K_M quantization. This uses approximately 4.9GB of VRAM, fits entirely in VRAM with headroom, and runs at approximately 40–60 tokens/second. For a near-full-quality experience, Llama-3 8B at Q6_K uses about 6.1GB and still fits in 8GB VRAM. Do not attempt to run 13B+ models on 8GB — they will spill into system RAM and crawl at 2–5 tokens/second regardless of your system RAM amount.

The VRAM Sweet Spot That Fixes Slow Local AI Models

Q: How much VRAM do I need to run a local LLM?

The minimum practical VRAM for running a useful local LLM at acceptable speed is 8GB (GPU VRAM, not system RAM). With 8GB, you can run Llama-3 8B at Q4_K_M quantization — a genuinely capable model for chat, writing, and coding tasks. With 12GB, you can run Llama-3 8B at Q8_0 (near full quality) or Mistral 12B at Q4_K_M. With 16–24GB, Mistral or Command R at Q4_K_M provides GPT-3.5-class intelligence locally. With 48GB+, you can run Llama-3 70B — GPT-4-class intelligence, fully offline. If a model doesn't fit in VRAM and spills into system RAM, speed drops from 30–100 tokens/second to 2–5 tokens/second.

Q: What is Q4_K_M quantization and why is it the sweet spot?

Q4_K_M is a 4-bit quantization method in the GGUF format that compresses an LLM's model weights to approximately 4 bits per parameter instead of the default 16 bits. This reduces file size and VRAM requirements by roughly 70–75% while causing less than 1–2% degradation in output quality on most reasoning and language tasks. It's called the 'sweet spot' because Q8_0 (8-bit) uses nearly as much VRAM as the full model without enough quality gain, while Q2_K (2-bit) causes noticeable quality loss. Q4_K_M gives you the best ratio of intelligence preserved to VRAM consumed.

Q: What is the difference between VRAM and system RAM for running LLMs?

VRAM (Video RAM) is the dedicated memory on your GPU — it runs at 300–900 GB/s bandwidth. System RAM (DDR4/DDR5) is your computer's main memory — it runs at 50–100 GB/s bandwidth. For local LLM inference, VRAM is 5–15x faster than system RAM per operation. A model that fits entirely in VRAM runs at 30–100+ tokens/second. A model that partially spills into system RAM runs at 5–20 tokens/second. A model that runs entirely on system RAM (CPU inference) runs at 1–5 tokens/second. Apple Silicon Macs are a special case — their unified memory architecture shares bandwidth between CPU and GPU, giving them effectively 200+ GB/s for both system and 'GPU' memory.

Here's what nobody tells you when you first set up Ollama: choosing the wrong model size relative to your GPU's VRAM is the difference between a snappy 60 tokens-per-second assistant and a glacial 2 tokens-per-second crawl that makes you want to close the terminal and go back to ChatGPT. The model didn't fail. Your hardware didn't fail. The math was wrong. And the math is fixable — if you understand three numbers: your VRAM, the model's parameter count, and the quantization level. This guide breaks all three down, and there's a free calculator that does the work for you in seconds.

Local LLM VRAM calculator — find the quantization sweet spot for your GPU to run AI models at full speed locally

The difference between 60 T/s and 2 T/s is almost entirely determined by whether your model fits in GPU VRAM — not by how powerful your CPU is or how much system RAM you have.

Running local LLMs in 2026 is more accessible than ever. Ollama, LM Studio, and similar tools have made the setup trivial. The problem isn't the software anymore.

The problem is that most tutorials tell you to download a model and run it — without explaining that if that model is too large for your GPU's VRAM, you've just signed up for the worst user experience in AI: watching a progress bar move two characters at a time.

30–100

Tokens/sec when a model fits 100% in GPU VRAM — instant, fluid responses

2–5

Tokens/sec when a model spills into system RAM — the crawl that kills the experience

<1%

Quality loss from Q4_K_M quantization — the sweet spot that fits big models in small VRAM

Why VRAM Is the Only Number That Actually Matters

Your GPU's VRAM (Video RAM) is not the same as your system RAM. They operate at completely different speeds — and for AI inference, that speed gap is everything.

Memory Bandwidth: Why VRAM vs. System RAM Changes Everything

GPU VRAM

~600 GB/s

Apple Unified

~200 GB/s

System RAM

~60 GB/s

Every token your LLM generates requires reading model weights from memory. When those weights live in VRAM at 600 GB/s, generation is instant. When they've spilled to system RAM at 60 GB/s, it's 10x slower. When running CPU-only, it's 50–100x slower than VRAM.

Rule: If model fits in VRAM → fast. If it spills → slow. System RAM size is nearly irrelevant.

"Unlike gaming, where FPS matters, in local AI, VRAM capacity is everything. If a model fits entirely in your GPU's VRAM, it runs at lightning speed. If it spills over into system RAM, speed drops by 10x–50x." — Local LLM VRAM Calculator documentation, solidaitech.com

Quantization — How to Compress a 70B Model Until It Fits in Your GPU

A Llama-3 70B model at full precision (FP16) weighs approximately 140GB. That's not fitting in any consumer GPU. But the same model at Q4_K_M quantization weighs about 43GB — and at Q2_K, around 26GB.

Quantization compresses the model's numerical weights from 16-bit floating-point numbers to lower-bit representations. The tradeoff: smaller file size and VRAM requirement, with some quality degradation depending on how aggressively you compress.

Format	Bits/Weight	Size (7B Model)	Quality vs FP16	Verdict
FP16	16 bit	~14 GB	100% — Reference	Requires premium VRAM — overkill for most
Q8_0	8 bit	~7.7 GB	~99.5%	Near-perfect quality, high VRAM
Q6_K	6 bit	~5.9 GB	~99%	Excellent quality/VRAM balance
Q4_K_M ⭐	4 bit	~4.8 GB	~98–99% — SWEET SPOT	Best intelligence-to-VRAM ratio
Q4_K_S	4 bit	~4.3 GB	~97%	Slightly lower quality than K_M
Q3_K_M	3 bit	~3.3 GB	~93%	Emergency VRAM saving — noticeable drop
Q2_K	2 bit	~2.9 GB	~85%	Model becomes noticeably "dumber"

Sizes based on 7B parameter model. Larger models scale proportionally. Quality percentages are approximate community benchmarks — exact figures vary by task type and model architecture.

🎯 Not sure which quantization your GPU can handle? The Local LLM VRAM Calculator takes your GPU VRAM, system RAM, and target context window and instantly recommends the exact model and quantization level that gives you the fastest possible speed — without guessing.

The Three Hardware Tiers — What Each GPU VRAM Tier Can Actually Run Well

Entry Level

8 GB VRAM

Llama-3 8B

at Q4_K_M (Balanced)

~40–60 T/s · Full GPU offload

Mid-Range

16–24 GB VRAM

Mistral 8×7B or Command R

at Q4_K_M (Balanced)

~25–45 T/s · Complex reasoning

Pro / Power

48 GB+ VRAM

Llama-3 70B

at Q4_K_M (Balanced)

~20–35 T/s · GPT-4 class offline

⚠️ The Context Window Multiplier Nobody Warns You About

VRAM requirements aren't just about the model weights. A larger context window (the amount of conversation history the model processes) adds significant VRAM overhead. A model comfortable at 4k context (standard chat) may exceed VRAM at 32k context (codebase RAG). The VRAM Calculator factors in your target context window as a separate variable — because most VRAM guides only show base model weights and forget this entirely.

How the Local LLM VRAM Calculator Works — Three Inputs, One Perfect Answer

The "Sweet Spot" Finder — What It Calculates

The Local LLM VRAM Calculator at solidaitech.com takes three inputs you can check in under 30 seconds:

GPU VRAM (Video Memory): The most critical factor. Check Task Manager → Performance → GPU, or your GPU specs page. This is the hard ceiling — models above this number will spill into system RAM.
System RAM (DDR4/DDR5): Used to determine how much layer offloading is possible if the model partially exceeds VRAM. More system RAM helps when you're in the partial-offload zone but doesn't replace VRAM for speed.
Target Context Window: Standard chat (4k tokens), extended conversation (16k), or codebase RAG (32k). Larger context windows consume more VRAM even with the same model weights.

The output gives you a specific model recommendation — not a range, a specific model name and quantization — along with estimated VRAM usage in GB, tokens/second estimate, the recommended file format (GGUF), and offload status (100% GPU, partial, or CPU bottleneck warning).

Free · Instant · Works for NVIDIA, AMD, Apple Silicon

🖥️ Need More VRAM? Browse NVIDIA RTX 50-Series GPUs

The RTX 5070 (12GB, ~$549) and RTX 5080 (16GB, ~$999) are the current sweet-spot cards for local AI in 2026. Check current pricing before buying.

Browse RTX 50-Series GPUs on Amazon →

GPU prices change frequently. Verify availability and pricing before purchasing.

What Most Local LLM Guides Get Wrong

💡 System RAM Amount Is Almost Irrelevant — VRAM Is What Counts

Every guide that says "make sure you have 32GB of RAM to run a 13B model" is conflating two very different things. System RAM matters for how many layers can be offloaded when you exceed VRAM — but the speed at those offloaded layers is still 10x slower than pure VRAM operation. Having 64GB of system RAM won't make a model that exceeds your VRAM run at acceptable speed. It just means you can run the model at all, rather than crashing. The only fix for slow inference is either less model or more VRAM.

💡 Q4_K_M vs. Q4_K_S — The Difference That Actually Matters

The GGUF format has several 4-bit variants and most tutorials just say "download Q4." Q4_K_M (Medium) uses a k-quants algorithm that applies higher precision to the most sensitive weight layers — the ones that affect reasoning and instruction-following most. Q4_K_S (Small) compresses more aggressively, saving an extra 300–500MB but with a slightly larger quality drop. For a 7B model, Q4_K_M is the better choice if your VRAM allows it. Only drop to Q4_K_S if you need that last half-gigabyte to fit the model in VRAM.

💡 Apple Silicon Macs Get a Free Performance Bonus

NVIDIA GPUs have separate VRAM that's physically isolated from system RAM. Apple Silicon (M2, M3, M4, M5) uses unified memory — the same physical chips serve both CPU and GPU. This means an M4 MacBook Pro with 32GB of unified memory effectively has 32GB available for LLM inference at ~200 GB/s bandwidth. That's slower than a high-end NVIDIA card's VRAM but significantly faster than any system RAM configuration on a PC. An M5 MacBook with 64GB unified memory can comfortably run Llama-3 70B at Q4_K_M — something that would require a $3,000+ NVIDIA RTX 6090 on a PC.

💡 The Flash Attention Layer Saves ~15% VRAM at Longer Contexts

Ollama 0.5+ and LM Studio 0.3+ both support Flash Attention — a memory-efficient attention mechanism that reduces the context window's VRAM overhead by approximately 15–25% at long context lengths (16k+ tokens). If you're trying to squeeze a model into VRAM for a codebase RAG workflow, make sure Flash Attention is enabled in your inference tool's settings before concluding the model won't fit. The VRAM Calculator accounts for this in its context-adjusted estimates.

Frequently Asked Questions

How much VRAM do I need to run a local LLM?

The minimum practical VRAM for a useful local LLM at acceptable speed is 8GB. With 8GB, you can run Llama-3 8B at Q4_K_M — a capable model for chat, writing, and coding at approximately 40–60 tokens/second. With 12GB, Llama-3 8B at Q8_0 (near-full quality) or Mistral 12B at Q4_K_M. With 16–24GB, Mixtral 8×7B at Q4_K_M provides GPT-3.5-class intelligence locally. With 48GB+, Llama-3 70B delivers GPT-4-class intelligence fully offline. The VRAM Calculator gives you a specific recommendation for your exact setup.

What is Q4_K_M quantization and why is it the sweet spot?

Q4_K_M is a 4-bit quantization method in the GGUF format that compresses model weights from 16 bits to ~4 bits per parameter — reducing file size and VRAM requirements by 70–75% while causing less than 1–2% degradation in output quality on most tasks. It's the sweet spot because Q8_0 uses nearly as much VRAM as the full model without enough quality gain, while Q2_K causes noticeable quality loss. Q4_K_M gives you the best intelligence-to-VRAM ratio of any quantization level. It uses the k-quants algorithm, which applies higher precision selectively to the most sensitive weight layers.

Why is my local LLM running at only 2–5 tokens per second?

Your model has exceeded your GPU's VRAM capacity and is offloading layers to system RAM. GPU VRAM operates at ~300–900 GB/s bandwidth. System RAM (DDR4/DDR5) operates at ~50–100 GB/s. When AI inference reads model weights from system RAM instead of VRAM, it's 10–50x slower. The fix: run a smaller model or a more aggressively quantized version that fits entirely in your GPU's VRAM. A model running 100% in VRAM generates at 30–100+ tokens/second. Check Ollama's output — it shows how many layers are loaded to GPU vs. CPU.

What local LLM should I run on an 8GB GPU?

For an 8GB GPU (RTX 5060, RTX 4060, RTX 3070): the sweet spot is Llama-3 8B at Q4_K_M, using approximately 4.9GB of VRAM with headroom for context, running at ~40–60 tokens/second. For near-full quality, Llama-3 8B at Q6_K uses ~6.1GB and still fits comfortably. Do not run 13B+ models on 8GB VRAM — they will spill into system RAM and crawl regardless of how much system RAM you have. The VRAM Calculator recommends the specific optimal model for your exact VRAM + context combination.

What's the difference between VRAM and system RAM for running LLMs?

VRAM (Video RAM) is your GPU's dedicated memory, operating at 300–900 GB/s bandwidth. System RAM (DDR4/DDR5) is your computer's main memory, operating at 50–100 GB/s. For local LLM inference, VRAM is 5–15x faster per operation. A model 100% in VRAM generates at 30–100+ tokens/second; partially in system RAM generates at 5–20 tokens/second; CPU-only generates at 1–5 tokens/second. Apple Silicon Macs are the exception — their unified memory runs at ~200 GB/s and serves both CPU and GPU, making 32–64GB M-series Macs surprisingly competitive for local LLM inference.

Stop Guessing — Match Your Model to Your VRAM

The single most common mistake in local AI setup in 2026 is downloading a model that sounds impressive and running it on hardware that can't keep it in VRAM. The result is an experience that makes you question whether local AI is worth the effort at all.

It is. You just need the right model for your hardware. A Llama-3 8B at Q4_K_M on an 8GB GPU gives you a genuinely impressive local AI at 60 tokens per second. That's faster than most people read. It's a better experience than many cloud services at their free tier. And it's running completely offline, privately, for the cost of the electricity to spin the GPU.

Know your VRAM. Pick the right quantization. Enjoy the sweet spot.

Disclosure: This post contains an affiliate link to Amazon for GPU hardware. If you purchase through this link, I may earn a small commission at no extra cost to you. VRAM estimates and token speed figures are approximations based on community benchmarks and may vary by system configuration, model architecture, and software version.

Latest

SolidAITech

Local LLM VRAM Calculator: Fix Slow Token Speeds

The VRAM Sweet Spot That Fixes Slow Local AI Models

Why VRAM Is the Only Number That Actually Matters

Memory Bandwidth: Why VRAM vs. System RAM Changes Everything

Quantization — How to Compress a 70B Model Until It Fits in Your GPU

The Three Hardware Tiers — What Each GPU VRAM Tier Can Actually Run Well

⚠️ The Context Window Multiplier Nobody Warns You About

How the Local LLM VRAM Calculator Works — Three Inputs, One Perfect Answer

The "Sweet Spot" Finder — What It Calculates

🖥️ Need More VRAM? Browse NVIDIA RTX 50-Series GPUs

What Most Local LLM Guides Get Wrong

💡 System RAM Amount Is Almost Irrelevant — VRAM Is What Counts

💡 Q4_K_M vs. Q4_K_S — The Difference That Actually Matters

💡 Apple Silicon Macs Get a Free Performance Bonus

💡 The Flash Attention Layer Saves ~15% VRAM at Longer Contexts

Frequently Asked Questions

How much VRAM do I need to run a local LLM?

What is Q4_K_M quantization and why is it the sweet spot?

Why is my local LLM running at only 2–5 tokens per second?

What local LLM should I run on an 8GB GPU?

What's the difference between VRAM and system RAM for running LLMs?

Stop Guessing — Match Your Model to Your VRAM

Free AI Tools

Featured Post

Best AI Laptops 2026 – NPU-Powered Performance That Actually Matters

Popular

Sponsor

Categories

Search This Blog

Tags

Categories

Popular Posts

Latest

SolidAITech

Local LLM VRAM Calculator: Fix Slow Token Speeds

The VRAM Sweet Spot That Fixes Slow Local AI Models

Why VRAM Is the Only Number That Actually Matters

Memory Bandwidth: Why VRAM vs. System RAM Changes Everything

Quantization — How to Compress a 70B Model Until It Fits in Your GPU

The Three Hardware Tiers — What Each GPU VRAM Tier Can Actually Run Well

⚠️ The Context Window Multiplier Nobody Warns You About

How the Local LLM VRAM Calculator Works — Three Inputs, One Perfect Answer

The "Sweet Spot" Finder — What It Calculates

🖥️ Need More VRAM? Browse NVIDIA RTX 50-Series GPUs

What Most Local LLM Guides Get Wrong

💡 System RAM Amount Is Almost Irrelevant — VRAM Is What Counts

💡 Q4_K_M vs. Q4_K_S — The Difference That Actually Matters

💡 Apple Silicon Macs Get a Free Performance Bonus

💡 The Flash Attention Layer Saves ~15% VRAM at Longer Contexts

Local LLM "Sweet Spot" Finder

Frequently Asked Questions

How much VRAM do I need to run a local LLM?

What is Q4_K_M quantization and why is it the sweet spot?

Why is my local LLM running at only 2–5 tokens per second?

What local LLM should I run on an 8GB GPU?

What's the difference between VRAM and system RAM for running LLMs?

Stop Guessing — Match Your Model to Your VRAM

Subscribe via email

Free AI Tools

Featured Post

Best AI Laptops 2026 – NPU-Powered Performance That Actually Matters

Popular

Sponsor

Categories

Search This Blog

Tags

Categories

Popular Posts