Understanding Diffusion Models

published on October 15, 2025

a deep dive into how diffusion models work, from the basic principles to the latest advancements in text-to-image generation.

diffusion models have recently taken the world of generative ai by storm. from dalle-2 to stable diffusion, these models produce stunningly realistic images from simple text prompts. but how do they work? the core idea is surprisingly elegant: start with noise, and slowly learn to reverse the noise to create an image.

The Core Idea: Adding Noise

imagine you have a clear photograph. now, you add a tiny amount of gaussian noise to it. the image is now slightly distorted. what if you repeat this process hundreds, or even thousands, of times? eventually, your original photograph will be indistinguishable from pure, random noise. this is the "forward process," and it's a fixed, straightforward procedure. For each time step, only image from the last time step is needed to generate the image at the current time step, which forms a Markov chain.

The Forward Process: A Mathematical View

mathematically, the forward process is a markov chain that gradually adds noise to data. given a data point x0, we can define the process as:

q ( xt | xt-1 ) = N ( xt ; 1 - βt xt-1 , βt I )

where βt is a small positive constant defining the variance schedule. a key property of this process is that we can sample xt at any arbitrary timestep t directly from x0:

q ( xt | x0 ) = N ( xt ; α ¯ t x0 , ( 1 - α ¯ t ) I )

where αt=1-βt and α¯t=i=1tαi. this allows us to efficiently train the model without having to iterate through the entire chain.

We can view diffusion model as a hierarchical variational autoencode where each node on Markov Chain is a latent variable corresponding to a noise level. Intuitively, since the model only predicts a slightly less noisy input, it becomes easier to fully reconstruct the original input after certain number of steps.

The Reverse Process: Learning to Denoise

the magic is in the "reverse process." here, we train a neural network to undo one step of the noising process. given a noisy image xt, the network's job is to predict the slightly less noisy image xt-1, or more commonly, to predict the noise that was added.

this is where deep learning comes in. we use a network, typically a u-net architecture, to learn this denoising step. the network takes the noisy image xt and the timestep t as input and outputs the predicted noise.

# simplified pseudocode for a training step
def training_step(original_image, model):
    # 1. pick a random timestep
    t = random_integer(1, T)
    
    # 2. generate noise
    noise = torch.randn_like(original_image)
    
    # 3. create the noisy image using the closed-form formula
    noisy_image = add_noise(original_image, noise, t)
    
    # 4. get the model's prediction of the noise
    predicted_noise = model(noisy_image, t)
    
    # 5. calculate the loss
    loss = mean_squared_error(noise, predicted_noise)
    
    # 6. update model weights
    loss.backward()
    optimizer.step()

Efficient Training

However, training model on a Markov Chain is not efficient. We have to go through all the time steps before performing one gradient update. To address this, we usually pick gaussian noise with a fixed variance and train the model to denoise the inputs since it has a closed-form solution for different noise levels and allow constant sampling for different timesteps.

Conclusion

By training a network to perform this simple denoising task at every possible noise level, the model implicitly learns the entire data distribution. To generate a new point point, we simply start with random noise (xT) and apply the learned denoising network iteratively for T steps until we arrive at a clean data point (x0). It's a powerful and scalable approach that forms the foundation of modern generative modeling.