The Core Idea: Learning to Undo Noise
Diffusion models are the current champions of high-quality image generation. The concept, inspired by thermodynamics, is both elegant and powerful. It consists of two opposite processes.
Part 1: The Forward Process (Diffusion)
This first part is simple, fixed, and requires no training. Its only purpose is to generate training data for the second part.
- Start with a real image from your dataset (e.g., a photo of a cat).
- In a predefined number of timesteps (e.g., T=1000), you gradually add a small amount of Gaussian noise to the image.
- You repeat this process, adding a little more noise at each step.
- After T steps, the original image is completely transformed into pure, random noise.
This process is a Markov chain, meaning the state of the image at any timestep t only depends on its state at t-1. Because we control the amount of noise added at each step, this process is predictable and mathematically defined.
Analogy: Imagine dropping a single speck of ink into a glass of water. The forward process is like watching the ink slowly and predictably diffuse until the water is uniformly gray. We know exactly how this happens.
Part 2: The Reverse Process (Denoising)
This is where the magic and the machine learning happen. The goal is to train a neural network to reverse the diffusion process.
The Training: The network's task is surprisingly simple: it's a denoiser.
- We take a random image from our dataset.
- We randomly pick a timestep, t.
- We use the forward process formula to generate the noisy version of our image at that timestep t.
- We feed this noisy image and the timestep t into our neural network (typically a U-Net architecture).
- The network's job is to predict the noise that was added to the image at that step. Not the original image, just the noise itself.
- We calculate the loss by comparing the network's predicted noise with the actual noise that we added.
- We update the network's weights through backpropagation.
By repeating this millions of times with images at all different stages of noisiness, the network becomes a master at predicting and removing a small amount of noise for any given timestep.
Generation (Sampling): Creating New Images
Once the network is trained, we can use it to generate brand new images.
- Start with pure noise: Create a new image tensor filled with random Gaussian noise. This is our starting point, x_T.
- Iterate backwards: We start at the last timestep, t=T, and work our way down to t=0.
- Predict and Subtract: In each step t, we feed the current image x_t into our trained network. The network predicts the noise that was added to create x_t. We then use a formula to subtract a fraction of this predicted noise from x_t to get a slightly cleaner image, x_{t-1}.
- Repeat: We take this new, cleaner image x_{t-1} and feed it back into the network to repeat the process for the next step.
As we iterate backwards from t=T down to t=0, a coherent, high-quality image slowly emerges from the initial random noise, as if sculpting a statue from a block of marble. This iterative refinement is what gives diffusion models their incredible quality.