Stable Spread Benchmarking: Fastest GPU Running AI (Updated)


Artificial intelligence and deep learning are constantly making headlines these days. Examples include ChatGPT generating poor advice, self-driving cars, artists accused of using AI, and medical advice from AI. Most of these tools rely on complex servers with lots of hardware for training, but using trained networks through inference can be done on a PC with a graphics card. But how fast are consumer GPUs doing AI inference?

We benchmarked the popular AI image generator Stable Diffusion on the latest Nvidia, AMD and even Intel GPUs to see how they stack up. If you’ve accidentally tried to install and run Stable Diffusion on your own PC, you can guess how complex and simple it is! – It’s possible. A quick summary is that Nvidia’s GPUs rule the roost, with most software designed using CUDA and other Nvidia toolsets. But that doesn’t mean you can’t run stable spread on other GPUs.

We used three different Stable Diffusion projects for testing. Mostly because no single package worked on all GPUs. For Nvidia I chose the webui version of Automatic 1111. (opens in a new tab); It performed best, had more options, and was easy to run. AMD GPU tested using Nod.ai’s version of Shark. (opens in a new tab), also testing on Nvidia GPUs (in Vulkan mode), will be updated soon. Stable Diffusion OpenVINO was a bit more difficult to run on Intel’s Arc GPUs due to lack of support, however. (opens in a new tab) gave us some Extremely basic function.

There is a disclaimer. We haven’t coded any of these tools, but we’ve found them to be easy to run (on Windows) and look reasonably well-optimized. We’re relatively confident that our Nvidia 30-series tests do a good job of extracting near-optimal performance. Performance is further improved by up to 20%, especially when xformers are enabled (reducing precision which can affect quality). Meanwhile, the RTX 40 series results were initially lower, but George SV8ARJ has provided this fix. (opens in a new tab)Replacing the PyTorch CUDA DLLs has greatly improved performance.

AMD results are also a bit of a mixed bag. The RDNA 3 GPU performs quite well, while the RDNA 2 GPU looks rather mediocre. Nod.ai says they are still working on a ‘tuned’ model for RDNA 2, which will significantly improve (potentially double) performance when released. Finally, while the ultimate performance on Intel GPUs seems to match the AMD options, they actually have much longer rendering times. It takes 5-10 seconds for the actual creation to start, probably with additional background tasks that slow it down.

We are also using a different Stable Diffusion model due to our software project selection. Nod.ai’s Shark version uses SD2.1, while Automatic 1111 and OpenVINO use SD1.4 (you can enable SD2.1 on Automatic 1111). Again, if you have inside knowledge of Stable Diffusion and would like to recommend another open source project that might run better than the one we used, let us know in the comments (or email Jarred). (opens in a new tab)).

Our test parameters are the same for all GPUs, but the Intel version doesn’t have a negative prompt option (at least none we could find). The above gallery was created using Automatic 1111’s webui on Nvidia GPU, high resolution output (time consuming, many takes longer to complete). Same prompt but targeting 2048×1152 instead of 512×512 used for benchmark. The setup we chose was chosen to work with all three SD projects. Some options that can improve throughput are only available in builds of the Automatic 1111, but more on that later. The relevant settings are:

Positive Prompt:
Post-apocalyptic Steampunk City, Exploration, Cinematic, Realistic, Ultra Detail, Realistic Max Detail, Volumetric Lighting, (((Focus))), Wide, (((Bright Lighting))), (((Vegetation))) , lightning, vines, destruction, ruins, warts, ruins

Negative Prompt:
(((cloudy))), ((fog)), (((dark))), ((black and white)), sun, (((depth of field)))

step:
100

Classifier Free Instructions:
15.0

Sampling Algorithm:
Some Euler variants (Ancestral from Automatic 1111, Shark Euler Discrete from AMD)

The sampling algorithm may affect the output, but it doesn’t appear to have a significant impact on performance. Auto 1111 gives you most of the options, but the Intel OpenVINO build gives you no choice.

Here are the test results for AMD RX 7000/6000 series, Nvidia RTX 40/30 series and Intel Arc A series GPUs. Each Nvidia GPU has two results. One uses the default computational model (slower and black) and the other uses Facebook’s faster “xformers” library. (opens in a new tab) (fast and green).

(Image credit: Tom’s Hardware)

As expected, Nvidia’s GPUs deliver better performance (sometimes by a huge margin) than anything from AMD or Intel. The DLL fix for Torch gives the RTX 4090 50% higher performance than the RTX 3090 Ti with xformers and 43% better performance without xformers. Each image takes just over 3 seconds to generate, and even the RTX 4070 Ti can outperform the 3090 Ti (but not with xformers disabled).

Things fall in a very consistent way on the higher end cards of Nvidia GPUs from the 3090 to the 3050. Meanwhile, AMD’s RX 7900 cards reach almost par with the RTX 3080, and all RTX 30 series cards eventually beat AMD’s RX 6000 series parts (for now). Finally, the Intel Arc GPU comes almost last, only the A770 beats the RX 6600. Let’s talk a little more about the weirdness.

Proper optimization can double the performance of RX 6000 series cards. Nod.ai says the model needs to be tweaked over the next few days, at which point the overall rankings should start to correlate better with theoretical performance. Speaking of Nod.ai, we did some tests on some Nvidia GPUs using that project, and in the Vulkan model, the Nvidia card had an Automatic build of 1111 (15.52 it/s on 4090, 13.31 on 4080, 11.41 on 3090 Ti). and 10.76 on the 3090 — can’t test the other cards as they need to be activated first.

Based on the 7900 card’s performance using the tuned model, we also suspect most Nvidia cards don’t use Tensor cores at all. If that’s true, fully utilizing the Tensor cores could give Nvidia a massive boost. The same logic applies to Intel cards as well.

Intel’s Arc GPUs currently deliver very disappointing results. Especially because it supports XMX (matrix) operations that should provide up to 4x the throughput of normal FP32 calculations. We believe that the current Stable Diffusion OpenVINO project we used also has a lot of room for improvement. As a side note, to run SD on Arc GPU, you need to edit the ‘stable_diffusion_engine.py’ file and change “CPU” to “GPU”. Otherwise it won’t use the graphics card for calculations. It takes much longer.

With the versions specified across the board, Nvidia’s RTX cards are usually the fastest picks, especially for the top models (3080 and up). AMD’s RX 7000 series cards are also great, but the RX 6000 series are underpowered and Arc GPUs are generally poor. Updated software can change things drastically, and given the popularity of AI, we predict it’s only a matter of time before we see better tweaks (or find the right projects that are already tuned to deliver better performance).

Again, it’s unclear exactly how optimized these projects are. It’s also unclear if these projects are utilizing the likes of Nvidia’s Tensor cores or Intel’s XMX cores. So it will be interesting to look at the maximum theoretical performance (TFLOPS) on different GPUs. The following chart shows the theoretical FP16 performance for each GPU using Tensor/Matrix cores where applicable.

(Image credit: Tom’s Hardware)

Nvidia’s Tensor cores are definitely strong performers, and the Stable Diffusion test doesn’t match these numbers exactly. For example, the RTX 4090 (with FP16) is up to 106% faster than the RTX 3090 Ti on paper, and in testing it was 43% faster without xformer and 50% faster with xformer. We’re also assuming that the Stable Diffusion project we used (Automatic 1111) doesn’t even try to take advantage of the new FP8 instructions on the Ada Lovelace GPU, which could potentially double the performance of the RTX 40 series again.

Meanwhile take a look at the Arc GPU. Their matrix core should deliver comparable performance to the RTX 3060 Ti and RX 7900 XTX, while the A380 sits around the RX 6800. In fact, Arc GPUs aren’t even anywhere near these marks. The fastest A770 GPU is between the RX 6600 and RX 6600 XT, the A750 is right behind the RX 6600, and the A380 is about a quarter the speed of the A750. So all of them are about 1/4 of the expected performance, which makes sense if you’re not using XMX cores.

However, Arc’s internal proportions look almost right. The A380’s theoretical computing power is about 1/4 that of the A750, which is about this in terms of current stable spread performance. Perhaps the GPU is using shaders for calculations in full precision FP32 mode and is missing some additional optimizations.

Another thing to note is that AMD’s RX 7900 XTX/XT’s theoretical compute has improved a lot compared to the RX 6000 series. We’ll have to see if the tuned 6000 series models close the gap. Memory bandwidth wasn’t a factor, at least for the 512×512 target resolution we used. The 3080 10GB and 12GB models are relatively close together. It’s a bit odd that the 7900 XT performs pretty much the same as the XTX. Raw compute should favor XTX by about 19% over the 3% we measured.

Ultimately, this is more of a snapshot of Stable Diffusion performance for AMD, Intel and Nvidia GPUs than a true performance statement. With full optimization, performance should come close to the theoretical TFLOPS chart and certainly the latest RTX 40 series cards should not lag behind the old RTX 30 series parts.

(Image credit: Tom’s Hardware)

The final chart with high-resolution testing is shown. I haven’t tested the new AMD GPUs yet, as I had to use Linux on the AMD RX 6000 series cards I tested. However, check out the RTX 40 series results with Torch DLLs replaced. The RTX 4090 is now 72% faster than the 3090 Ti without xformers and a whopping 134% faster with xformers. The 4080 also beats the 3090 Ti by 55%/18% with and without xformer. Interestingly, the 4070 Ti was 22% slower than the 3090 Ti without xformer, but 20% faster with xformer.

It looks like a more complex target resolution of 2048×1152 will start to make better use of potential compute resources, and longer execution times mean Tensor cores can unleash their power. (FWIW, it’s not yet clear if Tensor Cores are being used, or if the various SD projects are using GPU shaders to do FP16.) Similar improvements will be seen with AMD’s new GPUs, and what about Intel? We’ll see more revisiting of this topic next year, with better optimized code for all the different GPUs.

Leave a Comment