What GPU should I buy for Stable Diffusion and Flux.1 in 2026?

For 2026 AI image generation: Entry level (8–12GB VRAM): RTX 5060 Ti or RTX 5070 — runs SDXL well, Flux.1 Schnell at reduced batch sizes. Mid-range (16GB VRAM): RTX 5080 — the 2026 high-end standard for fluid real-time workflows, runs Flux.1 Dev at 65+ IT/s, handles batch generation reliably. Professional (24–32GB VRAM): RTX 5090 (32GB) — the current benchmark ceiling for AI art at ~95 IT/s raw speed for SD 1.5 workflows, handles large Flux.1 batch generation. Apple M5 Max/Ultra: Best for running massive memory-buffer workflows (100GB+ unified memory) but slower raw IT/s than NVIDIA CUDA.

What is an OOM error in Stable Diffusion and how do I avoid it?

OOM stands for Out Of Memory — it occurs when the VRAM required by your image generation task exceeds your GPU's available VRAM, causing a crash. Common triggers: using a large batch size (generating 4+ images simultaneously) that exceeds VRAM, running Flux.1 Dev at high resolution on a GPU with less than 12GB VRAM, or loading multiple models (checkpoint + ControlNet + LoRA) simultaneously. Prevention strategies: reduce batch size to 1 or 2, use half-precision (fp16) or 8-bit quantized models, enable xformers memory-efficient attention, or use the SD model's tiled VAE for high-resolution generation. The Stable Diffusion Speed Calculator shows estimated VRAM requirements before you generate — so you see the OOM risk before hitting it.

The Math Behind Your GPU Crash: A Free AI Speed Calculator

Q: How much VRAM does Flux.1 Dev require compared to SDXL?

Flux.1 Dev is significantly more VRAM-intensive than SDXL. A single 1024x1024 Flux.1 Dev image at 25 steps requires approximately 12–16GB of VRAM at full precision. SDXL at the same resolution requires approximately 6–8GB. For batch generation (multiple images at once), requirements scale proportionally — a batch of 4 Flux.1 images can exceed 40GB VRAM, requiring specialized hardware like the RTX 5090 (32GB) or an Apple M5 Ultra (192GB unified). Flux.1 Schnell is faster and slightly less VRAM-intensive. Stable Diffusion 3.5 Large falls between SDXL and Flux.1 Dev in resource requirements.

Q: Why does Flux.1 generate more slowly than SDXL on the same GPU?

Flux.1 Dev and Pro are architecturally heavier than SDXL — they use a Diffusion Transformer (DiT) architecture with significantly more parameters and attention layers than SDXL's UNet-based design. This means each denoising step in Flux.1 requires more floating-point operations and VRAM bandwidth per step than SDXL. The Stable Diffusion Speed Calculator applies a model architecture penalty multiplier to raw GPU IT/s: SDXL operates at approximately a 0.45 penalty vs raw SD 1.5 speeds, while Flux.1 Dev operates at approximately a 0.20 multiplier — meaning Flux.1 Dev runs at roughly 20% of a GPU's peak SD 1.5 speed.

Here's the situation nobody prepares you for: you set up a Flux.1 Dev workflow in ComfyUI, queue a batch of 10 images, walk away to get coffee, come back to find a red OOM crash screen. Or worse — the job started but it's been running for 20 minutes on what should take 30 seconds. Both have the same root cause: you didn't know how much VRAM and compute your specific model-GPU-batch combination would actually require. IT/s benchmarks aren't academic. They're the number that tells you exactly what's possible — before you waste the time finding out the hard way.

Stable Diffusion speed calculator — IT/s benchmark for Flux.1 SDXL SD3 on RTX 5090 5080 5070 GPUs 2026

IT/s (Iterations Per Second) is the most honest single benchmark for AI image generation performance — and it varies dramatically between model architectures even on identical hardware.

I used to benchmark GPUs the lazy way: look at raw VRAM numbers and assume more is better. Then I started seriously running Flux.1 workflows and realized that the relationship between VRAM, model architecture, and generation speed is a three-way equation — and getting any variable wrong means either an OOM crash or generation speeds so slow they defeat the purpose of running locally.

The IT/s metric cuts through all of that. Once you understand what it measures and how it changes with different models, you can predict your generation time for any batch before you queue it.

~95 it/s

RTX 5090 peak speed for SD 1.5 FP16 — the benchmark ceiling for current consumer hardware

~0.20×

Flux.1 Dev architecture penalty — meaning a 95 it/s card generates Flux.1 at only ~19 it/s

49 GB

VRAM required for Flux.1 Dev batch of 10 on RTX 5090 — triggering OOM on every current GPU

The IT/s Math — How Generation Speed Is Actually Calculated

IT/s (Iterations Per Second) measures how many denoising steps your GPU processes every second. The formula for total generation time is straightforward:

The Three-Variable Formula

Total Time = (Sampling Steps ÷ Real IT/s) × Batch Size + 0.3s VAE overhead

The key is that "Real IT/s" isn't your GPU's raw speed — it's that speed multiplied by a model architecture penalty. A card that runs SD 1.5 at 95 it/s doesn't run Flux.1 Dev at 95 it/s. Not even close.

The Stable Diffusion Speed Calculator applies specific penalty multipliers based on model architecture to give you the actual expected IT/s — not the theoretical ceiling that never reflects real workflows.

The Model Architecture Penalty — Why Flux.1 Is So Much Slower Than SDXL

This is the insight that changes how you plan your workflows. Every model architecture demands a different fraction of your GPU's raw compute power per denoising step. The Stable Diffusion Speed Calculator uses three penalty multipliers:

Legacy UNet

SD 1.5

1.00×

Reference baseline. Fastest architecture per step. Lightest VRAM per image.

Transformer UNet

SDXL 1.0

~0.45×

High Quality Standard. 2× heavier than SD 1.5. VRAM ~6–10GB per image.

Diffusion Transformer

Flux.1 Dev/Pro

~0.20×

Extremely heavy. 5× slower than SD 1.5. Requires 12–16GB VRAM per image.

"A GPU that runs SD 1.5 at 95 it/s will run SDXL at ~43 it/s and Flux.1 Dev at only ~19 it/s. Same card, same VRAM — completely different performance due to architecture complexity." — Stable Diffusion Speed Calculator methodology, solidaitech.com

Stable Diffusion 3.5 Large sits between SDXL and Flux.1 Dev in penalty — heavier than SDXL, but doesn't hit the full Flux.1 weight. SD3.5 Medium is closer to SDXL in compute demands.

⚡ Want to know your exact IT/s for your GPU and model before you queue anything? The Stable Diffusion Speed Calculator takes your GPU, model architecture, step count, and batch size — and returns exact generation time predictions plus OOM warnings before you waste compute.

Real-World IT/s Benchmarks — 2026 GPU Hardware vs. Model Architectures

GPU	VRAM	SD 1.5	SDXL	Flux.1 Dev	1 Flux Image (25 steps)
NVIDIA RTX 5090	32GB GDDR7	~95 it/s	~43 it/s	~19 it/s	~1.6 sec
NVIDIA RTX 5080	16GB GDDR7	~72 it/s	~32 it/s	~14 it/s	~2.1 sec
NVIDIA RTX 5070	12GB GDDR7	~52 it/s	~23 it/s	~10 it/s	~2.8 sec
NVIDIA RTX 5060 Ti	16GB GDDR7	~38 it/s	~17 it/s	~7.6 it/s	~3.6 sec
Apple M5 Max	48GB+ Unified	~28 it/s	~12 it/s	~5.6 it/s	~4.8 sec
Apple M5 (base)	16GB Unified	~14 it/s	~6 it/s	~2.8 it/s	~9.2 sec
CPU-only inference	System RAM	~1–2 it/s	~0.5 it/s	Not practical	Minutes

IT/s figures are calculated estimates based on the Speed Calculator's heuristic model using raw hardware benchmarks and architecture penalty multipliers. Actual results vary by system configuration, precision format (fp16/bf16), xformers, and ComfyUI/A1111 version. Apple figures use Metal backend.

VRAM Is the Gatekeeper of Batch Size — Here's the Math

Generating a single image has a fixed VRAM cost. Generating a batch multiplies that cost linearly. This is where OOM errors strike — not when running one image, but when you increase batch size without doing the math first.

Flux.1 Dev VRAM Requirements by Batch Size

Batch of 1

~12 GB

Batch of 2

~21 GB

Batch of 4

~38 GB

Batch of 10

~49 GB 💀

Even the RTX 5090 (32GB) OOMs on a batch of 10 Flux.1 Dev images at 1024×1024. The Speed Calculator shows the OOM Warning before you generate — so you see the red flag without wasting the time.

⚠️ OOM Warning — What It Means and What to Do

When the Speed Calculator shows "OOM Warning — Model requires more VRAM than your GPU has," your options are: (1) Reduce batch size — the fastest fix, scale back until VRAM requirement drops below your card's limit. (2) Use Flux.1 Schnell instead of Dev — Schnell is designed for 4-step generation with significantly lower per-image VRAM overhead. (3) Enable tiled VAE in ComfyUI — reduces VAE decode VRAM at the cost of slightly longer decode time. (4) Use fp8 or quantized Flux weights — reduces the model's memory footprint by ~30–40%.

🖥️ Need More VRAM for Flux.1 Batch Workflows?

The RTX 5080 (16GB, ~$999) is the 2026 high-end standard — 65+ it/s for fluid real-time Flux.1 Schnell workflows. Check current pricing on Amazon.

Browse RTX 5080 / 5090 on Amazon →

GPU prices change frequently — verify availability before purchasing.

The 2026 Hardware Debate — NVIDIA RTX 5090 vs. Apple M5 Max

If you're building an AI image generation workstation in 2026, the choice largely comes down to two philosophies: raw CUDA throughput (NVIDIA) or massive memory buffer (Apple).

NVIDIA RTX 5090 (32GB GDDR7)

The RTX 5090 is the undisputed champion of raw IT/s for AI image generation. The massive leap in GDDR7 memory bandwidth allows it to tear through heavy Flux.1 and SD3 workloads at speeds no previous consumer card could approach. For high-volume batch generation — especially 8K upscaled outputs or large concurrent multi-model workflows — nothing in the consumer market touches NVIDIA's CUDA architecture in 2026.

The limitation is VRAM ceiling. At 32GB, the 5090 still OOMs on very large Flux.1 batches. It's not a constraint for typical workflows, but it exists.

Apple M5 Max / Ultra

Apple took a completely different route. The M5 Ultra's unified memory architecture allows up to 192GB of RAM shared between CPU and GPU. This means Mac users can load massive 100B+ parameter LLMs and large image models into memory simultaneously — a feat that would require multiple RTX 5090s on a PC build.

The trade-off is raw IT/s. An M5 Max runs Flux.1 Dev at approximately 5–6 IT/s — competitive with a mid-range NVIDIA card. For prosumers who want near-silent operation, massive memory buffers, and the ability to run LLM + image generation workflows simultaneously without swapping, the M5 Ultra is genuinely compelling despite the slower raw throughput.

What Most AI Art Benchmark Guides Skip

💡 Flux.1 Schnell Needs Only 4 Steps — Completely Changes the Math

Flux.1 Dev is designed for 20–25 steps. Flux.1 Schnell was specifically distilled to generate high-quality images in just 4 steps. At the same 19 it/s on an RTX 5090, that's a 4-step Schnell image in approximately 0.5 seconds versus a 25-step Dev image in 1.6 seconds. For batch workflows where you need volume over maximum quality, Flux.1 Schnell on a mid-range GPU can outpace Flux.1 Dev on higher-end hardware in images-per-hour throughput.

💡 Batch Size 1 Is Almost Always Faster Than Large Batches Per-Image

Counterintuitive but true: batching images doesn't reduce per-image compute cost — it multiplies VRAM usage while generating all images in parallel. On a GPU where VRAM allows batch-4 without OOM, the 4 images finish in approximately the same time as 1 image — meaning batch-4 is effectively 4x faster in total throughput. But if batch-4 causes partial VRAM spillover, you lose more than you gain. The Speed Calculator's batch-time estimate accounts for this so you can find the optimal batch size for your specific GPU and model combination.

💡 xformers vs. Flash Attention — The 15% Speed Difference Most People Leave on the Table

Both xformers (for NVIDIA, via A1111 and older ComfyUI) and Flash Attention 2 (ComfyUI native, newer NVIDIA and Apple Metal) are memory-efficient attention implementations that reduce VRAM usage by 15–25% and can improve IT/s by 10–20% simultaneously. If you're running ComfyUI without Flash Attention enabled, you're leaving real speed on the table. Check your ComfyUI startup logs — if you see "No xformers or Flash Attention available," that's costing you meaningful performance.

💡 The VAE Decode Step Is Invisible in Most Benchmarks — But It Adds 0.3 Seconds Per Image

Most IT/s benchmarks measure the denoising loop only. They don't include VAE decode time — the step where the latent representation is converted to a viewable pixel image. At standard resolutions, this adds approximately 0.3 seconds per image. For a batch of 10, that's 3 extra seconds that never appears in IT/s comparisons but always shows up in actual workflow time. The Speed Calculator includes this overhead in its total time estimate because real workflows include it.

Frequently Asked Questions

What is IT/s in Stable Diffusion and why does it matter?

IT/s stands for Iterations Per Second — it measures how many denoising steps your GPU processes each second during image generation. Total generation time = (Sampling Steps ÷ IT/s) + ~0.3s VAE decode. An RTX 5090 running Flux.1 Dev at 19 IT/s takes approximately 1.6 seconds for a 25-step image. The same GPU running SDXL at 43 IT/s generates a 30-step image in about 1.0 second. IT/s is the single most useful benchmark for comparing GPU performance for AI image workflows — and it varies dramatically between model architectures even on identical hardware.

How much VRAM does Flux.1 Dev require compared to SDXL?

Flux.1 Dev requires approximately 12–16GB of VRAM for a single 1024×1024 image at full precision. SDXL at the same resolution requires approximately 6–8GB. For batch generation, requirements scale proportionally — a Flux.1 batch of 4 can exceed 38GB VRAM, and a batch of 10 exceeds 49GB, triggering OOM on every current consumer GPU including the RTX 5090 (32GB). Flux.1 Schnell has a lower per-image VRAM overhead and is the better choice for batch workflows on cards below 32GB.

What GPU should I buy for Stable Diffusion in 2026?

For 2026 AI art workflows: Entry-level (8–12GB): RTX 5060 Ti or RTX 5070 — runs SDXL well, Flux.1 Schnell at batch-1. Mid-range sweet spot (16GB): RTX 5080 — the 2026 high-end standard for fluid real-time Flux.1 Dev workflows at 65+ it/s. Professional (32GB): RTX 5090 — current benchmark ceiling at ~95 IT/s SD 1.5 baseline, handles large Flux.1 batch generation. Apple M5 Max/Ultra: Slower raw IT/s but massive unified memory (up to 192GB) for running LLM + image workflows simultaneously.

Why does Flux.1 generate more slowly than SDXL on the same GPU?

Flux.1 uses a Diffusion Transformer (DiT) architecture with far more parameters and attention layers than SDXL's UNet-based design. Each denoising step in Flux.1 requires approximately 5× more floating-point operations than an equivalent SD 1.5 step — giving it a ~0.20× penalty multiplier on raw hardware speed. SDXL operates at a ~0.45× penalty. The Speed Calculator applies these multipliers to your GPU's peak SD 1.5 speed to show your real expected IT/s for each model architecture.

What is an OOM error in Stable Diffusion and how do I prevent it?

OOM (Out Of Memory) occurs when VRAM required exceeds your GPU's capacity, causing a crash. Prevention: (1) Reduce batch size until VRAM requirement fits your card. (2) Use Flux.1 Schnell instead of Dev — significantly lower VRAM overhead. (3) Enable tiled VAE in ComfyUI — reduces decode VRAM. (4) Use fp8 or quantized Flux model weights (30–40% VRAM reduction). (5) Enable Flash Attention 2 for 15–25% VRAM reduction on supported hardware. The Stable Diffusion Speed Calculator shows VRAM requirements before generation — including OOM warnings — so you avoid the crash entirely.

Know Your Numbers Before You Generate

The AI image generation landscape in 2026 is the most capable it's ever been — and the most demanding. Flux.1 Dev produces genuinely stunning results that outclass anything available two years ago. SD3.5 Large brings creative range that professional designers are using for client work. The trade-off is compute and VRAM requirements that require more deliberate planning than the SDXL era demanded.

IT/s benchmarks aren't a nerdy detail. They're the number that tells you whether your workflow is viable before you spend 20 minutes watching a progress bar — or five minutes staring at an OOM error.

Calculate your numbers before your next session. Your GPU will thank you.

Disclosure: This post contains an affiliate link to Amazon for GPU hardware. If you purchase through this link, I may earn a small commission at no extra cost to you. All IT/s figures are calculated estimates using the Speed Calculator's heuristic model — actual results vary by system configuration, precision format, and software version.

Latest

SolidAITech

Stable Diffusion & Flux Speed Calculator: IT/s Benchmark 2026

The Math Behind Your GPU Crash: A Free AI Speed Calculator