Skip to main content

AI Diffusion Models Explained Mathematically in Simple Terms

AI Diffusion Models Explained Mathematically in Simple Terms

What Are Diffusion Models?

Imagine you have a clear photo of a cat. Now imagine gradually adding static noise to it, like on an old TV, until it becomes pure random noise. Diffusion models learn to reverse this process - they start with random noise and gradually remove it to create a clear image.

The Core Math (Made Simple)

Forward Process: Adding Noise

The forward process adds noise to data in small steps. Mathematically:

x(t) = √(1-β) × x(t-1) + √β × noise

Where:

  • x(t) = image at time step t
  • β = small number (like 0.0001) controlling how much noise to add
  • noise = random static

Think of it like adding a tiny bit of fog to a window each second until you can't see through it.

Reverse Process: Removing Noise

The magic happens when we reverse this process. The model learns:

x(t-1) = 1/√(1-β) × [x(t) - √β × predicted_noise]

The AI learns to predict what noise was added, then subtracts it!

Why This Works: The Brilliant Insight

The key insight is that removing a little bit of noise is much easier than generating an entire image from scratch. It's like:

  • Hard: "Draw a cat"
  • Easy: "This fuzzy image has some static on it, clean it up a bit"

By breaking generation into hundreds of tiny "clean up a bit" steps, we make an impossible problem possible.

The Training Process

  1. Take a real image
  2. Add noise for T steps (typically 1000)
  3. Train the AI to predict the noise at each step
  4. Loss function: How wrong was the noise prediction?

Loss = ||actual_noise - predicted_noise||²

Generating New Images

Once trained, to generate a new image:

  1. Start with pure random noise
  2. For 1000 steps, ask the model: "What noise should I remove?"
  3. Remove that predicted noise
  4. Repeat until you have a clear image

The Score Function (Advanced but Simple)

The model actually learns the "score function" - the gradient of the data distribution:

score = -∇log p(x)

In simple terms: "Which direction should I move to make this look more like a real image?"

Why Diffusion Models Are Special

Compared to GANs:

  • More stable training
  • Better mode coverage (more variety)
  • Easier to control

Compared to VAEs:

  • Higher quality images
  • More flexible

Key Mathematical Concepts

1. Markov Chain

Each step only depends on the previous step, not the entire history:

  • x(999) → x(998) → x(997) → ... → x(0)

2. Gaussian Distribution

The noise added is Gaussian (bell curve) shaped:

  • Most values near zero
  • Few extreme values
  • Natural and mathematically convenient

3. Variance Schedule

We control how much noise to add at each step:

  • β₁ = 0.0001 (tiny noise at first)
  • β₁₀₀₀ = 0.02 (more noise later)

Simple Code Example (Conceptual)

# Training
for image in dataset:
    t = random_timestep()
    noise = generate_random_noise()
    noisy_image = add_noise(image, t, noise)
    predicted_noise = model(noisy_image, t)
    loss = mean_squared_error(noise, predicted_noise)
    
# Generation
x = random_noise()
for t in reverse(1000):
    predicted_noise = model(x, t)
    x = remove_noise(x, predicted_noise, t)
return x  # Your generated image!

Real-World Applications

  1. Text-to-Image (DALL-E 2, Stable Diffusion)

    • Condition the denoising on text descriptions
  2. Image Editing

    • Start denoising from a partially noised image
  3. Video Generation

    • Apply diffusion in time dimension too
  4. 3D Generation

    • Diffusion on 3D voxels or point clouds

The Beautiful Mathematics

The diffusion equation comes from physics (heat diffusion):

∂p/∂t = ∇²p

This describes how heat (or in our case, noise) spreads over time. We're essentially reversing heat flow!

Key Takeaways

  1. Diffusion = Gradual Noising + Denoising
  2. Small steps make hard problems easy
  3. The model learns to predict noise, not images
  4. Physics-inspired math makes it work

Why Should You Care?

Diffusion models are behind:

  • AI art generators
  • Photo editing tools
  • Video generation
  • 3D model creation
  • Scientific simulations

They're not just another AI technique - they're a fundamental breakthrough in how we think about generation problems.


Understanding Diffusion Models

Understanding Diffusion Models Mathematically

Introduction

Diffusion models are generative models that progressively transform random noise into meaningful data. They work by first adding noise to the data in a controlled manner and then learning to reverse this process to generate realistic samples.

Gaussian Distribution

A Gaussian (normal) distribution is defined as:

\[ \mathcal{N}(x; \mu, \Sigma) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp\left(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\right) \]

where:

  • \( x \) is the random variable (vector in high-dimensional space).
  • \( \mu \) is the mean vector, representing the expected value of \( x \).
  • \( \Sigma \) is the covariance matrix, which determines the spread and correlation of \( x \) across dimensions.
  • \( d \) is the number of dimensions.

The covariance matrix \( \Sigma \) determines the shape of the Gaussian distribution. If \( \Sigma = I \) (identity matrix), the distribution is isotropic (same spread in all directions).

Forward Diffusion Process

In the forward process, we add small amounts of Gaussian noise to an image \( x_0 \) over \( T \) time steps until it becomes pure noise \( x_T \). The transition from step \( t-1 \) to \( t \) follows a Gaussian distribution:

\[ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) \]

where:

  • \( x_t \) is the noisy image at step \( t \).
  • \( \beta_t \) is a small variance term controlling the amount of noise added.
  • \( I \) is the identity matrix, meaning noise is added independently to each pixel.

The cumulative effect of noise at any time \( t \) can be written as:

\[ q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \]

where:

  • \( \bar{\alpha}_t = \prod_{s=1}^{t} (1 - \beta_s) \) represents the accumulated noise schedule.

Reverse Process (Denoising)

To generate new images, we reverse the diffusion process by predicting and removing noise. The reverse step follows another Gaussian distribution:

\[ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I) \]

where:

  • \( \mu_\theta(x_t, t) \) is the predicted mean of the original image at step \( t \), estimated by a neural network.
  • \( \sigma_t^2 \) is the variance of the reverse process, typically learned or predefined.

The neural network predicts the noise \( \epsilon_\theta(x_t, t) \), and the denoised mean is computed as:

\[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{1 - \beta_t}} \left( x_t - \beta_t \frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}} \right) \]

Training Objective

The model is trained to predict noise by minimizing the mean squared error (MSE) loss:

\[ L(\theta) = \mathbb{E}_{x_0, \epsilon, t} \left[ ||\epsilon - \epsilon_\theta(x_t, t)||^2 \right] \]

where:

  • \( \epsilon \) is the actual Gaussian noise added to the data.
  • \( \epsilon_\theta(x_t, t) \) is the noise predicted by the model.

This loss function ensures that the model learns to correctly remove noise at each step.

Sampling (Generating Images)

To generate an image, we start with pure Gaussian noise \( x_T \) and iteratively apply the denoising process:

\[ x_{t-1} = \mu_\theta(x_t, t) + \sigma_t z, \quad \text{where } z \sim \mathcal{N}(0, I) \]

This gradually transforms random noise into a structured image.

Summary

  • Forward Process: Adds noise step by step using \( x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \).
  • Reverse Process: Uses a neural network to predict noise and gradually remove it.
  • Loss Function: Trains the model to predict noise by minimizing \( L(\theta) = \mathbb{E}[||\epsilon - \epsilon_\theta(x_t, t)||^2] \).
  • Sampling: Starts with noise and applies learned denoising steps to generate images.

Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...