AI Diffusion Models Explained Mathematically in Simple Terms
What Are Diffusion Models?
Imagine you have a clear photo of a cat. Now imagine gradually adding static noise to it, like on an old TV, until it becomes pure random noise. Diffusion models learn to reverse this process - they start with random noise and gradually remove it to create a clear image.
The Core Math (Made Simple)
Forward Process: Adding Noise
The forward process adds noise to data in small steps. Mathematically:
x(t) = √(1-β) × x(t-1) + √β × noise
Where:
- x(t) = image at time step t
- β = small number (like 0.0001) controlling how much noise to add
- noise = random static
Think of it like adding a tiny bit of fog to a window each second until you can't see through it.
Reverse Process: Removing Noise
The magic happens when we reverse this process. The model learns:
x(t-1) = 1/√(1-β) × [x(t) - √β × predicted_noise]
The AI learns to predict what noise was added, then subtracts it!
Why This Works: The Brilliant Insight
The key insight is that removing a little bit of noise is much easier than generating an entire image from scratch. It's like:
- Hard: "Draw a cat"
- Easy: "This fuzzy image has some static on it, clean it up a bit"
By breaking generation into hundreds of tiny "clean up a bit" steps, we make an impossible problem possible.
The Training Process
- Take a real image
- Add noise for T steps (typically 1000)
- Train the AI to predict the noise at each step
- Loss function: How wrong was the noise prediction?
Loss = ||actual_noise - predicted_noise||²
Generating New Images
Once trained, to generate a new image:
- Start with pure random noise
- For 1000 steps, ask the model: "What noise should I remove?"
- Remove that predicted noise
- Repeat until you have a clear image
The Score Function (Advanced but Simple)
The model actually learns the "score function" - the gradient of the data distribution:
score = -∇log p(x)
In simple terms: "Which direction should I move to make this look more like a real image?"
Why Diffusion Models Are Special
Compared to GANs:
- More stable training
- Better mode coverage (more variety)
- Easier to control
Compared to VAEs:
- Higher quality images
- More flexible
Key Mathematical Concepts
1. Markov Chain
Each step only depends on the previous step, not the entire history:
- x(999) → x(998) → x(997) → ... → x(0)
2. Gaussian Distribution
The noise added is Gaussian (bell curve) shaped:
- Most values near zero
- Few extreme values
- Natural and mathematically convenient
3. Variance Schedule
We control how much noise to add at each step:
- β₁ = 0.0001 (tiny noise at first)
- β₁₀₀₀ = 0.02 (more noise later)
Simple Code Example (Conceptual)
# Training
for image in dataset:
t = random_timestep()
noise = generate_random_noise()
noisy_image = add_noise(image, t, noise)
predicted_noise = model(noisy_image, t)
loss = mean_squared_error(noise, predicted_noise)
# Generation
x = random_noise()
for t in reverse(1000):
predicted_noise = model(x, t)
x = remove_noise(x, predicted_noise, t)
return x # Your generated image!
Real-World Applications
-
Text-to-Image (DALL-E 2, Stable Diffusion)
- Condition the denoising on text descriptions
-
Image Editing
- Start denoising from a partially noised image
-
Video Generation
- Apply diffusion in time dimension too
-
3D Generation
- Diffusion on 3D voxels or point clouds
The Beautiful Mathematics
The diffusion equation comes from physics (heat diffusion):
∂p/∂t = ∇²p
This describes how heat (or in our case, noise) spreads over time. We're essentially reversing heat flow!
Key Takeaways
- Diffusion = Gradual Noising + Denoising
- Small steps make hard problems easy
- The model learns to predict noise, not images
- Physics-inspired math makes it work
Why Should You Care?
Diffusion models are behind:
- AI art generators
- Photo editing tools
- Video generation
- 3D model creation
- Scientific simulations
They're not just another AI technique - they're a fundamental breakthrough in how we think about generation problems.
Understanding Diffusion Models Mathematically
Introduction
Diffusion models are generative models that progressively transform random noise into meaningful data. They work by first adding noise to the data in a controlled manner and then learning to reverse this process to generate realistic samples.
Gaussian Distribution
A Gaussian (normal) distribution is defined as:
\[ \mathcal{N}(x; \mu, \Sigma) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp\left(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\right) \]
where:
- \( x \) is the random variable (vector in high-dimensional space).
- \( \mu \) is the mean vector, representing the expected value of \( x \).
- \( \Sigma \) is the covariance matrix, which determines the spread and correlation of \( x \) across dimensions.
- \( d \) is the number of dimensions.
The covariance matrix \( \Sigma \) determines the shape of the Gaussian distribution. If \( \Sigma = I \) (identity matrix), the distribution is isotropic (same spread in all directions).
Forward Diffusion Process
In the forward process, we add small amounts of Gaussian noise to an image \( x_0 \) over \( T \) time steps until it becomes pure noise \( x_T \). The transition from step \( t-1 \) to \( t \) follows a Gaussian distribution:
\[ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) \]
where:
- \( x_t \) is the noisy image at step \( t \).
- \( \beta_t \) is a small variance term controlling the amount of noise added.
- \( I \) is the identity matrix, meaning noise is added independently to each pixel.
The cumulative effect of noise at any time \( t \) can be written as:
\[ q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \]
where:
- \( \bar{\alpha}_t = \prod_{s=1}^{t} (1 - \beta_s) \) represents the accumulated noise schedule.
Reverse Process (Denoising)
To generate new images, we reverse the diffusion process by predicting and removing noise. The reverse step follows another Gaussian distribution:
\[ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I) \]
where:
- \( \mu_\theta(x_t, t) \) is the predicted mean of the original image at step \( t \), estimated by a neural network.
- \( \sigma_t^2 \) is the variance of the reverse process, typically learned or predefined.
The neural network predicts the noise \( \epsilon_\theta(x_t, t) \), and the denoised mean is computed as:
\[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{1 - \beta_t}} \left( x_t - \beta_t \frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}} \right) \]
Training Objective
The model is trained to predict noise by minimizing the mean squared error (MSE) loss:
\[ L(\theta) = \mathbb{E}_{x_0, \epsilon, t} \left[ ||\epsilon - \epsilon_\theta(x_t, t)||^2 \right] \]
where:
- \( \epsilon \) is the actual Gaussian noise added to the data.
- \( \epsilon_\theta(x_t, t) \) is the noise predicted by the model.
This loss function ensures that the model learns to correctly remove noise at each step.
Sampling (Generating Images)
To generate an image, we start with pure Gaussian noise \( x_T \) and iteratively apply the denoising process:
\[ x_{t-1} = \mu_\theta(x_t, t) + \sigma_t z, \quad \text{where } z \sim \mathcal{N}(0, I) \]
This gradually transforms random noise into a structured image.
Summary
- Forward Process: Adds noise step by step using \( x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \).
- Reverse Process: Uses a neural network to predict noise and gradually remove it.
- Loss Function: Trains the model to predict noise by minimizing \( L(\theta) = \mathbb{E}[||\epsilon - \epsilon_\theta(x_t, t)||^2] \).
- Sampling: Starts with noise and applies learned denoising steps to generate images.
Comments
Post a Comment