The Core Idea: Learning to Undo Noise

Diffusion models are the current champions of high-quality image generation. The concept, inspired by thermodynamics, is both elegant and powerful. It consists of two opposite processes.

Part 1: The Forward Process (Diffusion)

This first part is simple, fixed, and requires no training. Its only purpose is to generate training data for the second part.

  1. Start with a real image from your dataset (e.g., a photo of a cat).
  2. In a predefined number of timesteps (e.g., T=1000), you gradually add a small amount of Gaussian noise to the image.
  3. You repeat this process, adding a little more noise at each step.
  4. After T steps, the original image is completely transformed into pure, random noise.

This process is a Markov chain, meaning the state of the image at any timestep t only depends on its state at t-1. Because we control the amount of noise added at each step, this process is predictable and mathematically defined.

Analogy: Imagine dropping a single speck of ink into a glass of water. The forward process is like watching the ink slowly and predictably diffuse until the water is uniformly gray. We know exactly how this happens.

Part 2: The Reverse Process (Denoising)

This is where the magic and the machine learning happen. The goal is to train a neural network to reverse the diffusion process.

The Training: The network's task is surprisingly simple: it's a denoiser.

  1. We take a random image from our dataset.
  2. We randomly pick a timestep, t.
  3. We use the forward process formula to generate the noisy version of our image at that timestep t.
  4. We feed this noisy image and the timestep t into our neural network (typically a U-Net architecture).
  5. The network's job is to predict the noise that was added to the image at that step. Not the original image, just the noise itself.
  6. We calculate the loss by comparing the network's predicted noise with the actual noise that we added.
  7. We update the network's weights through backpropagation.

By repeating this millions of times with images at all different stages of noisiness, the network becomes a master at predicting and removing a small amount of noise for any given timestep.

Generation (Sampling): Creating New Images

Once the network is trained, we can use it to generate brand new images.

  1. Start with pure noise: Create a new image tensor filled with random Gaussian noise. This is our starting point, x_T.
  2. Iterate backwards: We start at the last timestep, t=T, and work our way down to t=0.
  3. Predict and Subtract: In each step t, we feed the current image x_t into our trained network. The network predicts the noise that was added to create x_t. We then use a formula to subtract a fraction of this predicted noise from x_t to get a slightly cleaner image, x_{t-1}.
  4. Repeat: We take this new, cleaner image x_{t-1} and feed it back into the network to repeat the process for the next step.

As we iterate backwards from t=T down to t=0, a coherent, high-quality image slowly emerges from the initial random noise, as if sculpting a statue from a block of marble. This iterative refinement is what gives diffusion models their incredible quality.