Diffusion Models: Understanding Forward and Reverse Processes

This article provides an in-depth look at the mechanisms of forward and reverse processes in diffusion models, exploring how they are used to train and generate data effectively.
Author: Bahman Moraffah
Estimated Reading Time: 20 min
Published: 2021

Forward Process

Recall that the continuous diffusion models transform data \( x \) drawn from a distribution \( p_\text{data}(x) \) through a variance-preserving Markov process known as forward process. The forward process introduces noise to the data over a sequence of continuous time steps \( \lambda \in [ \lambda_{\text{min}}, \lambda_{\text{max}}] \), \( \lambda_{\text{min}} < \lambda_{\text{max}} \),

\[ q(z_\lambda|x) = \mathcal{N}(\alpha_\lambda x, \sigma_\lambda^2 I), \]

where \( \alpha_\lambda^2 = \frac{1}{1 + e^{-\lambda}} \) and \( \sigma_\lambda^2 = 1 - \alpha_\lambda^2 \). For intermediate values of \( \lambda \), i.e., transitioning from one noise level to a higher one, the transition is modeled as

\[ q(z_\lambda|z_{\lambda'}) = \mathcal{N}\left(\left(\frac{\alpha_\lambda}{\alpha_{\lambda'}}\right) z_{\lambda'}, \sigma_{\lambda|\lambda'}^2 I\right), \]

where \( \sigma_{\lambda|\lambda'}^2 = (1 - e^{\lambda-\lambda'})\sigma_\lambda^2 \) and \( \lambda < \lambda' \), which means the forward process goes in the direction of decreasing \( \lambda \). Conditioning on \( x \), the forward process is

\[ q(z_{\lambda'}|z_{\lambda}, x) = \mathcal{N}(\mu_{\lambda'|\lambda}(z_{\lambda}, x), \sigma^2_{\lambda'|\lambda} I), \]

where \( \mu_{\lambda'|\lambda}(z_{\lambda}, x) = e^{\lambda - \lambda'}\left(\frac{\alpha_{\lambda'}}{\alpha_{\lambda}} z_{\lambda} + (1 - e^{\lambda - \lambda'})\alpha_{\lambda'} x\right)\), and \( \sigma_{\lambda|\lambda'}^2 = (1 - e^{\lambda-\lambda'})\sigma_\lambda^2 \), decreasing \( \lambda \).

Reverse Process

The reverse process is described by Gaussian transitions to generate samples progressively from standard Gaussian noise, \(p_\theta(z_{\lambda_\text{min}}) \sim \mathcal{N}(0,I)\), and uses neural networks to parameterize the reverse process,

\[ p_\theta(z_{\lambda'}|z_\lambda) = \mathcal{N}(\tilde{\mu}_{\lambda'|\lambda}(z_\lambda, x_\theta(z_\lambda, \lambda)), (\tilde{\sigma}^2_{\lambda'|\lambda})^{1-\nu}(\sigma^2_{\lambda'|\lambda})^\nu), \]

where \(x_\theta(z_\lambda, \lambda) = \frac{z_\lambda - \sigma_\lambda \epsilon_\theta(z_\lambda, \lambda)}{\alpha_\lambda}\) represents a neural network that estimate of the original data \(x\) given a particular noisy observation \(z_\lambda\), \( \epsilon_\theta(z_\lambda, \lambda) \) is another neural network or the same network with different outputs, predicting the noise component added at the diffusion step characterized by \( \lambda \) and \( \sigma_\lambda \) and \( \alpha_\lambda \) are coefficients that define how the noise scales and the data is attenuated during the forward process at noise level \( \lambda \). The variance \( \sigma^2_{\lambda'| \lambda}\) for the reverse transition from \(z_\lambda\) to \(z_{\lambda'}\), where \( \lambda' > \lambda\) indicates a move towards lower noise levels, can be a function of the variances at each of these noise levels,

\[ \Sigma_{\lambda'| \lambda} = \exp((1-\nu) \log \tilde{\sigma}^2_{\lambda'|\lambda} + \nu \log \sigma^2_{\lambda'|\lambda}), \]

where \( \Sigma_{\lambda'| \lambda}\) is the interpolated variance, as discussed in [2]. The hyperparameter \( \nu\) weights how much the direct step variance versus the target level variance contributes to the transition. It is worth emphasizing that during inference runs in an increasing sequence of \( \lambda\), from higher noise to low noise and clean data.

Training the Model

To train this model, we minimize the discrepancy between the noise added in the forward process and the noise estimated by the model,

\[ \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I), \lambda\sim p(\lambda)} \left[ \| \epsilon_{\theta}(z_{\lambda, \lambda}) - \epsilon \|_2^2 \right], \]

where \( p(\lambda) \) is a distribution over \([\lambda_\text{min}, \lambda_\text{max} ]\), \(\epsilon\) is the noise used in the forward process, \( \epsilon_{\theta}(z_{\lambda}, \lambda) \) is the model estimate of this noise, and \(z_\lambda = \alpha_\lambda x + \sigma_\lambda \epsilon\). This is a form of denoising score matching that helps the model learn to reverse the noise addition accurately. The distribution of \( \lambda \) can affect the training dynamics and the focus on different noise levels. A specific scheduling, such as the cosine noise schedule, modulates \( \lambda \) values across the training and influences the distribution and variance of the noise levels encountered. Uniform distribution on \(\lambda\) results in the objective being proportional to the variational lower bound on the marginal log likelihood of the latent variable model.

Model Inference

Once trained, sampling from the model involves running the reverse process using the predicted noise values to progressively denoise the data. This process uses Langevin dynamics where samples are drawn progressively closer to the distribution of the original dataset \( p_\text{data}(x)\).

References

[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.

[2] Nichol, Alexander Quinn, and Prafulla Dhariwal. "Improved denoising diffusion probabilistic models." In International conference on machine learning, pp. 8162-8171. PMLR, 2021.

[3] Song, Yang, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. "Score-based generative modeling through stochastic differential equations." arXiv preprint arXiv:2011.13456 (2020).

[4] Sohl-Dickstein, Jascha, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. "Deep unsupervised learning using nonequilibrium thermodynamics." In International conference on machine learning, pp. 2256-2265. PMLR, 2015.