Easy Diffusion:

Post from Mar 31, 2026

2026-03-31T09:09:17Z

Development update for Easy Diffusion: the beta branch has been merged into main, so this releases v3.5 (webui) and v4 to everyone. This shouldn’t affect existing users who’re on the main branch, i.e. people using the v3 engine will continue doing so. The two engines (v3.5 and v4) are marked as optional, so new users will continue to get and use v3 by default.

The main purpose of this update is to merge the two forked codebases that we’ve had for over 1.5 years. Now the main and beta branches are back in sync. This brings back the streamlined release process that we had previously, where new changes would first land in beta, and then get merged into main after testing.

Post from Mar 27, 2026

2026-03-27T07:11:33Z

Got Easy Diffusion v4 working on Apple and Intel Macs. The performance difference ratio (vs ED v3) is similar to the ratio on Windows (with CUDA) and other deployment targets. So that indicates optimization opportunities in sd.cpp. It’s currently about 1.5x slower than diffusers-based Stable Diffusion.

In other news, easyinstaller is also out with its first release, which means that Easy Diffusion can now start shipping AppImage, Flatpak, rpm, deb, pkg, dmg etc for the different platforms. Instead of requiring Linux and Mac users to use the terminal to install and start Easy Diffusion. Will work on this soon.

Post from Jan 18, 2026

2026-01-18T07:43:30Z

Started the long-pending rewrite of Easy Diffusion’s server code. v4 intends to replace the Python (and PyTorch) based server with a simple C++ version. The reason for rewriting the server in C++ is to achieve sub-second startup time for the UI, and to reduce the download size (won’t need to distribute Python along with Easy Diffusion) or mess with conda/venv etc. And it’s also something that I want to do for personal taste, i.e. de-bloating what doesn’t need to be bloated.

Post from Jan 08, 2026

2026-01-08T06:11:47Z

For Z-Image, the performance of the stock version of chromaForge is poorer than sd.cpp’s. Mainly because chromaForge isn’t able to run the smaller gguf quantized models that sd.cpp is able to run (chromaForge fails with the errors that I was fixing yesterday).

If I really want to push through with this, it would be good to fix the remaining issues with gguf models in chromaForge. Only then can the performance be truly compared (in order to decide whether to release this into ED 3.5). I want to compare the performance of the smaller gguf models, because that’s what ED’s users will run typically.

Post from Jan 07, 2026

2026-01-07T18:19:36Z

Worked on fixing Z-Image support in ED’s fork of chromaForge (a fork of Forge WebUI). Fixed a number of integration issues. It’s now crashing on a matrix multiplication error, which looks like an incorrectly transposed matrix (mostly due to reading the weights in the wrong order).

I’ll try to install a stock version of chromaForge to see its raw performance with Z-Image (and whether it’s worth pursuing the integration), and also use it to help investigate the matrix multiplication error (and any future errors).

Post from Dec 25, 2025

2025-12-25T08:48:39Z

Collecting the worklog over the past few weeks.

Enabled Flash-Attention and CPU offloading by default in sdkit3 (i.e. Easy Diffusion v4).
Added optional VAE tiling (and VAE tile size configuration) via config.yaml in Easy Diffusion v4.
Created Easy Diffusion’s fork of Forge WebUI, in order to apply the patches required to run with ED. And also to try adding new features like Z-Image (which are missing in the seemingly-abandoned main Forge repo).
Improved the heuristics used for killing and restarting the backend child process, since /ping requests are unreliable if the backend is under heavy load.
Merged a few PRs (1 2) for torchruntime that improve support for pinning pre-cu128 torch versions and fix the order of detection of DirectML and CUDA (prefers CUDA).
Added progress bars when downloading v4 backend artifacts.

Post from Dec 08, 2025

2025-12-08T12:55:31Z

The new engine that’ll power Easy Diffusion’s upcoming v4 release (i.e. sdkit3) has now been integrated into Easy Diffusion. It’s available to test by selecting v4 engine in the Settings tab (after enabling Beta). Please press Save and restart Easy Diffusion after selecting this.

It uses stable-diffusion.cpp and ggml under-the-hood, and produces optimized, lightweight builds for the target hardware.

The main benefits of Easy Diffusion’s new engine are:

Very lightweight - Less than 100 MB install footprint, compared to 3 GB+ for Forge and other PyTorch-based engines.
Much better for AMD/Intel/Integrated users - avoids the hot mess of ROCm and DirectML, by using a reliable Vulkan backend (that’s also used in llama.cpp).
Opportunity for even faster image generation in the future - this currently uses stock sd.cpp, which has room for further optimization.
Support for older GPUs - Vulkan supports older GPUs, especially older AMD GPUs unsupported by ROCm/PyTorch.

This supports:

Post from Nov 27, 2025

2025-11-27T10:05:12Z

Managed to get stable-diffusion.cpp integrated into sdkit v3 and Easy Diffusion.

sdkit v3 wraps stable-diffusion.cpp with an API server. For now, the API server exposes an API compatible with Forge WebUI. This saves me time, and allows Easy Diffusion to work out-of-the-box with the new C++ based sdkit.

It compiles and runs quite well. Ran it with Easy Diffusion’s UI. Tested with Vulkan and CUDA, on Windows.

There are a few feature gaps (e.g. gfpgan, more controlnet models, more controlnet filters, more schedulers/samplers, reload specific models instead of everything), but stable-diffusion.cpp has come a long way over the past year. The performance is reasonable. Not as fast as Forge or diffusers, but respectable. I haven’t spent any time on performance optimizations yet.

Post from Nov 19, 2025

2025-11-19T05:44:02Z

Following up to the previous post on sdkit v3’s design:

The initial experiments with generating ggml from onnx models were promising, and it looks like a fairly solid path forward. It produces numerically-identical results, and there’s a clear path to reach performance-parity with stable-diffusion.cpp with a few basic optimizations (since both will eventually generate the same underlying ggml graph).

But I think it’s better to use the simpler option first, i.e. use stable-diffusion.cpp directly. It mostly meets the design goals for sdkit v3 (after a bit of performance tuning). Everything else is premature optimization and scope bloat.

Post from Nov 18, 2025

2025-11-18T11:13:19Z

Successfully compiled the VAE of Stable Diffusion 1.5 using graph-compiler.

The compiled model is terribly slow because I haven’t written any performance optimizations, and it (conservatively) converts a lot of intermediate tensors to contiguous copies. But we don’t need any clever optimizations to get to decent performance, just basic ones.

It’s pretty exciting because I was able to bypass the need to port the model to C++ manually. Instead, I was able to just compile the exported ONNX model and get the same output values as the original PyTorch implementation (given the same input and weights). I could compile to any platform supported by ggml by just changing one flag (e.g. CPU, CUDA, ROCm, Vulkan, Metal etc).

Post from Nov 13, 2025

2025-11-13T09:46:31Z

PolyBlocks is another interesting ML compiler, written using MLIR. It’s a startup incubated in IISc Bangalore, run by someone (Uday Bondhugula) who co-authored a paper on compiler optimizations for GPGPUs back in 2008 (17 years ago)!

Some of the compiler passes to keep in mind:

fusion
tiling
use hardware acceleration (like tensor cores)
constant folding
perform redundant computation to avoid global memory accesses where profitable
pack into buffers
loop transformation
unroll-and-jam (register tiling?)
vectorization
reorder execution for better spatial, temporary and group reuse

Scheduling approaches:

Post from Nov 07, 2025

2025-11-07T11:21:47Z

Wrote a simple script to convert ONNX to GGML. It auto-generates C++ code that calls the corresponding ggml functions (for each ONNX operator). This file can then be compiled and run like a normal C++ ggml program, and will produce the same results as the original model in PyTorch.

The generated file can work on multiple backends: CPU, CUDA, ROCm, Vulkan, Metal etc, by providing the correct compiler flags during cmake -B, e.g. -D GGML_CUDA=1 for CUDA.

Post from Nov 05, 2025

2025-11-05T09:47:33Z

Following up to the deep-dive on ML compilers:

sdkit v3 won’t use general-purpose ML compilers. They aren’t yet ready for sdkit’s target platforms, and need a lot of work (well beyond sdkit v3’s scope). But I’m quite certain that sdkit v4 will use them, and sdkit v3 will start making steps in that direction.

For sdkit v3, I see two possible paths:

Use an array of vendor-specific compilers (like TensorRT-RTX, MiGraphX, OpenVINO etc), one for each target platform.
Auto-generate ggml code from onnx (or pytorch), and beat it on the head until it meets sdkit v3’s performance goals. Hand-tune kernels, contribute to ggml, and take advantage of ggml’s multi-backend kernels.

Both approaches provide a big step-up from sdkit v2 in terms of install size and performance. So it makes sense to tap into these first, and leave ML compilers for v4 (as another leap forward).

Post from Nov 05, 2025

2025-11-05T09:43:31Z

This post concludes (for now) my ongoing deep-dive into ML compilers, while researching for sdkit v3. I’ve linked (at the end) to some of the papers that I read related to graph execution on GPUs.

Some final takeaways:

ML compilers might break CUDA’s moat (and fix AMD’s ROCm support).
A single compiler is unlikely to fit every scenario.
The scheduler needs to be grounded in truth.
Simulators might be worth exploring more.

ML compilers might break CUDA’s moat (and fix AMD’s ROCm support)

It’s pretty clear that ML compilers are going to be a big deal. NVIDIA’s TensorRT is also an ML compiler, but it only targets their GPUs. Once the generated machine code (from cross-vendor ML compilers) is comparable in performance to hand-tuned kernels, these compilers are going to break the (in)famous moat of CUDA.

Post from Oct 27, 2025

2025-10-27T10:14:42Z

A possible intuition for understanding GPU memory hierarchy (and the performance penalty for data transfer between various layers) is to think of it like a manufacturing logistics problem:

CPU (host) to GPU (device) is like travelling overnight between two cities. The CPU city is like the “headquarters”, and contains a mega-sized warehouse of parts (think football field sizes), also known as ‘Host memory’.
Each GPU is like a different city, containing its own warehouse outside the city, also known as ‘Global Memory’. This warehouse stockpiles whatever it needs from the headquarters city (CPU).
Each SM/Core/Tile is a factory located in different areas of the city. Each factory contains a small warehouse for stockpiling whatever inventory it needs, also known as ‘Shared Memory’.
Each warp is a bulk stamping machine inside the factory, producing 32 items in one shot. There’s a tray next to each machine, also known as ‘Registers’. This tray is used for keeping stuff temporarily for each stamping process.

This analogy can help understand the scale and performance penalty for data transfers.

Post from Oct 24, 2025

2025-10-24T05:21:11Z

Good post on using MLIR for compiling ML models to GPUs. It gives a good broad overview of a GPU architecture, and how MLIR fits into that. The overall series looks pretty interesting too!

Making a note here for future reference - https://www.stephendiehl.com/posts/mlir_gpu/

Post from Oct 22, 2025

2025-10-22T07:45:34Z

Wrote a fresh implementation of most of the popular samplers and schedulers used for image generation (Stable Diffusion and Flux) at https://github.com/cmdr2/samplers.cpp. A few other schedulers (like Align Your Steps) have been left out for now, but are pretty easy to implement.

It’s still work-in-progress, and is not ready for public use. The algorithmic port has been completed, and the next step is to test the output values against reference values (from another implementation, e.g. Forge WebUI). After that, I’ll translate it to C++.

Post from Oct 10, 2025

2025-10-10T09:35:45Z

Some notes on machine-learning compilers, gathered while researching tech for Easy Diffusion’s next engine (i.e. sdkit v3). For context, see the design constraints of the new engine.

tl;dr summary

The current state is:

Vendor-specific compilers are the only performant options on consumer GPUs right now. For e.g. TensorRT-RTX for NVIDIA, MiGraphX for AMD, OpenVINO for Intel.
Cross-vendor compilers are just not performant enough right now for Stable Diffusion-class workloads on consumer GPUs. For e.g. like TVM, IREE, XLA.

The focus of cross-vendor compilers seems to be either on datacenter hardware, or embedded devices. The performance on desktops and laptops is pretty poor. Mojo doesn’t target this category (and doesn’t support Windows). Probably because datacenters and embedded devices are currently where the attention (and money) is.

Post from Oct 10, 2025

2025-10-10T08:44:54Z

The design constraints for Easy Diffusion’s next engine (i.e. sdkit v3) are:

Lean: Install size of < 200 MB uncompressed (excluding models).
Fast: Performance within 10% of the best-possible speed on that GPU for that model.
Capable: Supports Stable Diffusion 1.x, 2.x, 3.x, XL, Flux, Chroma, ControlNet, LORA, Embedding, VAE. Supports loading custom model weights (from civitai etc), and memory offloading (for smaller GPUs).
Targets: Desktops and Laptops, Windows/Linux/Mac, NVIDIA/AMD/Intel/Apple.

I think it’s possible, using ML compilers like TensorRT-RTX (and similar compilers for other platforms). See: Some notes on ML compilers.

Post from Sep 01, 2025

2025-09-01T08:03:25Z

Cleared the backlog of stale issues on ED’s github repo. This brought down the number of open issues from ~350 to 74.

A number of those suggestions and issues are already being tracked on my task board. The others had either been fixed, or were really old (i.e. not relevant to reply anymore).

While I’d have genuinely wanted to solve all of those unresolved issues, I was on a break from this project for nearly 1.5 years, so unfortunately it is what it is.

Post from Aug 25, 2025

2025-08-25T09:20:01Z

Experimented with TensorRT-RTX (a new library offered by NVIDIA).

The first step was a tiny toy model, just to get the build and test setup working.

The reference model in PyTorch:

import torch
import torch.nn as nn

class TinyCNN(nn.Module):
 def __init__(self):
 super().__init__()
 self.conv = nn.Conv2d(3, 8, 3, stride=1, padding=1)
 self.relu = nn.ReLU()
 self.pool = nn.AdaptiveAvgPool2d((1, 1))
 self.fc = nn.Linear(8, 4) # 4-class toy output

 def forward(self, x):
 x = self.relu(self.conv(x))
 x = self.pool(x).flatten(1)
 return self.fc(x)

I ran this on a NVIDIA 4060 8 GB (Laptop) for 10K iterations, on Windows and WSL-with-Ubuntu, with float32 data.

Post from Jun 17, 2025

2025-06-17T05:01:14Z

Development update for Easy Diffusion - It’s chugging along in starts and stops. Broadly, there are three tracks:

Maintenance: The past few months have seen increased support for AMD, Intel and integrated GPUs. This includes AMD on Windows. Added support for the new AMD 9060/9070 cards last week, and the new NVIDIA 50xx cards in March.
Flux to the main branch / release v3.5 to stable: Right now, Flux / v3.5 still requires you to enable ED beta first. And then install Forge. Last week I got Flux working in our main engine (with decent rendering speed). It still needs more work to support all the different models formats for Flux. Using Forge was a temporary arrangement, until Flux worked in our main engine.

Post from Mar 04, 2025

2025-03-04T21:07:26Z

Upgraded the default version of Easy Diffusion to Python 3.9. Newer versions of torch don’t support Python 3.8, so this became urgent after the release of NVIDIA’s 50xx series GPUs.

I choose 3.9 as a temporary fix (instead of a newer Python version), since it had the least amount of package conflicts. The future direction of Easy Diffusion’s backend is unclear right now - there are a bunch of possible paths. So I didn’t want to spend too much time on this. I also wanted to minimize the risk to existing users.

Post from Feb 10, 2025

2025-02-10T11:27:17Z

Easy Diffusion (and sdkit) now also support AMD on Windows automatically (using DirectML), thanks to integrating with torchruntime. It also supports integrated GPUs (Intel and AMD) on Windows, making Easy Diffusion faster on PCs without dedicated graphics cards.

Post from Feb 10, 2025

2025-02-10T11:23:22Z

Spent the last week or two getting torchruntime fully integrated into Easy Diffusion, and making sure that it handles all the edge-cases.

Easy Diffusion now uses torchruntime to automatically install the best-possible version of torch (on the users’ computer) and support a wider variety of GPUs (as well as older GPUs). And it uses a GPU-agnostic device API, so Easy Diffusion will automatically support additional GPUs when they are supported by torchruntime.

Post from Jan 28, 2025

2025-01-28T22:17:32Z

Continued to test and fix issues in sdkit, after the change to support DirectML. The change is fairly intrusive, since it removes direct references to torch.cuda with a layer of abstraction.

Fixed a few regressions, and it now passes all the regression tests for CPU and CUDA support (i.e. existing users). Will test for DirectML next, although it will fail (with out-of-memory) for anything but the simplest tests (since DirectML is quirky with memory allocation).

Post from Jan 27, 2025

2025-01-27T21:01:32Z

Worked on adding support for DirectML in sdkit. This allows AMD GPUs and Integrated GPUs to generate images on Windows.

DirectML seems like it’s really inefficient with memory though. So for now it only manages to generate images using SD 1.5. XL and larger models fail to generate, even though I have a 12 GB of VRAM in my graphics card.

Post from Jan 22, 2025

2025-01-22T17:19:42Z

Continued from Part 1.

Spent a few days figuring out how to compile binary wheels of PyTorch and include all the necessary libraries (ROCm libs or CUDA libs).

tl;dr - In Part 2, the compiled PyTorch wheels now include the required libraries (including ROCm). But this isn’t over yet. Torch starts now, but adding two numbers with it produces garbage values (on the GPU). There’s probably a bug in the included ROCBLAS version, might need to recompile ROCBLAS for gfx803 separately. Will tackle that in Part 3 (tbd).

Post from Jan 17, 2025

2025-01-17T17:19:42Z

Continued in Part 2, where I figured out how to include the required libraries in the wheel.

Spent all of yesterday trying to compile pytorch with the compile-time PYTORCH_ROCM_ARCH=gfx803 environment variable.

tl;dr - In Part 1, I compiled wheels for PyTorch with ROCm, in order to add support for older AMD cards like RX 480. I managed to compile the wheels, but the wheel doesn’t include the required ROCm libraries. I figured that out in Part 2.

Post from Jan 13, 2025

2025-01-13T14:46:46Z

Spent the last few days writing torchruntime, which will automatically install the correct torch distribution based on the user’s OS and graphics card. This package was written by extracting this logic out of Easy Diffusion, and refactoring it into a cleaner implementation (with tests).

It can be installed (on Win/Linux/Mac) using pip install torchruntime.

The main intention is that it’ll be easier for developers to contribute updates (for e.g. for newer or older GPUs). It wasn’t easy to find or modify this code previously, since it was buried deep inside Easy Diffusion’s internals.

Post from Jan 04, 2025

2025-01-04T19:57:06Z

Spent most of the day doing some support work for Easy Diffusion, and experimenting with torch-directml for AMD support on Windows.

From the initial experiments, torch-directml seems to work properly with Easy Diffusion. I ran it on my NVIDIA card, and another user ran it on their AMD Radeon RX 7700 XT.

It’s 7-10x faster than the CPU, so looks promising. It’s 2x slower than CUDA on my NVIDIA card, but users with NVIDIA cards are not the target audience of this change.

Post from Jan 03, 2025

2025-01-03T15:38:31Z

Spent a few days prototyping a UI for Easy Diffusion v4. Files are at this repo.

The main focus was to get a simple but pluggable UI, that was backed by a reactive data model, and to allow splitting the codebase into individual components (with their own files). And require only a text editor and a browser to develop, i.e. no compilation or nodejs-based developer experiences.

I really want something that is easy to understand - for an outside developer and for myself (for e.g. if I’m returning to a portion of the codebase after a while). And with very little friction to start developing for it.

Post from Dec 17, 2024

2024-12-17T11:03:10Z

Notes on two directions for ED4’s UI that I’m unlikely to continue on.

One is to start a desktop app with a full-screen webview (for the app UI). The other is writing the tabbed browser-like shell of ED4 in a compiled language (like Go or C++) and loading the contents of the tabs as regular webpages (by using webviews). So it would load URLs like http://localhost:9000/ui/image_editor and http://localhost:9000/ui/settings etc.

In the first approach, we would start an empty full-screen webview, and let the webpage draw the entire UI, including the tabbed shell. The only purpose of this would be to start a desktop app instead of opening a browser tab, while being very lightweight (compared to Electron/Tauri style implementations).

Post from Dec 14, 2024

2024-12-14T19:47:38Z

Worked on a few UI design ideas for Easy Diffusion v4. I’ve uploaded the work-in-progress mockups at https://github.com/easydiffusion/files.

So far, I’ve mocked out the design for the outer skeleton. That is, the new tabbed interface, the status bar, and the unified main menu. I also worked on how they would look like on mobile devices.

It gives me a rough idea of the Vue components that would need to be written, and the surface area that plugins can impact. For e.g. plugins can add a new menu entry only in the Plugins sub-menu.

Post from Nov 21, 2024

2024-11-21T15:17:56Z

Spent some more time on the v4 experiments for Easy Diffusion (i.e. C++ based, fast-startup, lightweight). stable-diffusion.cpp is missing a few features, which will be necessary for Easy Diffusion’s typical workflow. I wasn’t keen on forking stable-diffusion.cpp, but it’s probably faster to work on a fork for now.

For now, I’ve added live preview and per-step progress callbacks (based on a few pending pull-requests on sd.cpp). And protection from GGML_ASSERT killing the entire process. I’ve been looking at the ability to load individual models (like the vae) without needing to reload the entire SD model.

Post from Nov 19, 2024

2024-11-19T19:18:15Z

Spent a few days getting a C++ based version of Easy Diffusion working, using stable-diffusion.cpp. I’m working with a fork of stable-diffusion.cpp here, to add a few changes like per-step callbacks, live image previews etc.

It doesn’t have a UI yet, and currently hardcodes a model path. It exposes a RESTful API server (written using the Crow C++ library), and uses a simple task manager that runs image generation tasks on a thread. The generated images are available at an API endpoint, and it shows the binary JPEG/PNG image (instead of base64 encoding).

Post from Oct 16, 2024

2024-10-16T18:10:25Z

tl;dr - Today, I worked on using stable-diffusion.cpp in a simple C++ program. As a linked library, as well as compiling sd.cpp from scratch (with and without CUDA). The intent was to get a tiny and fast-starting executable UI for Stable Diffusion working. Also, ChatGPT is very helpful!

Part 1: Using sd.cpp as a library

First, I tried calling the stable-diffusion.cpp library from a simple C++ program (which just loads the model and renders an image). Via dynamic linking. That worked, and its performance was the same as the example sd.exe CLI, and it detected and used the GPU correctly.

Post from Sep 04, 2024

2024-09-04T15:20:49Z

tl;dr: Explored a possible optimization for Flux with diffusers when using enable_sequential_cpu_offload(). It did not work.

While trying to use Flux (nearly 22 GB of weights) with diffusers on a 12 GB graphics card, I noticed that it barely used any GPU memory when using enable_sequential_cpu_offload(). And it was super slow. It turns out that the largest module in Flux’s transformer model is around 108 MB, so because diffusers streams modules one-at-a-time, the peak VRAM usage never crossed above a few hundred MBs.