Forward Process
Recall that the continuous diffusion models transform data \( x \) drawn from a distribution \( p_\text{data}(x) \) through a variance-preserving Markov process known as forward process. The forward process introduces noise to the data over a sequence of continuous time steps \( \lambda \in [ \lambda_{\text{min}}, \lambda_{\text{max}}] \), \( \lambda_{\text{min}} < \lambda_{\text{max}} \),
\[ q(z_\lambda|x) = \mathcal{N}(\alpha_\lambda x, \sigma_\lambda^2 I), \]
where \( \alpha_\lambda^2 = \frac{1}{1 + e^{-\lambda}} \) and \( \sigma_\lambda^2 = 1 - \alpha_\lambda^2 \). For intermediate values of \( \lambda \), i.e., transitioning from one noise level to a higher one, the transition is modeled as
\[ q(z_\lambda|z_{\lambda'}) = \mathcal{N}\left(\left(\frac{\alpha_\lambda}{\alpha_{\lambda'}}\right) z_{\lambda'}, \sigma_{\lambda|\lambda'}^2 I\right), \]
where \( \sigma_{\lambda|\lambda'}^2 = (1 - e^{\lambda-\lambda'})\sigma_\lambda^2 \) and \( \lambda < \lambda' \), which means the forward process goes in the direction of decreasing \( \lambda \). Conditioning on \( x \), the forward process is
\[ q(z_{\lambda'}|z_{\lambda}, x) = \mathcal{N}(\mu_{\lambda'|\lambda}(z_{\lambda}, x), \sigma^2_{\lambda'|\lambda} I), \]
where \( \mu_{\lambda'|\lambda}(z_{\lambda}, x) = e^{\lambda - \lambda'}\left(\frac{\alpha_{\lambda'}}{\alpha_{\lambda}} z_{\lambda} + (1 - e^{\lambda - \lambda'})\alpha_{\lambda'} x\right)\), and \( \sigma_{\lambda|\lambda'}^2 = (1 - e^{\lambda-\lambda'})\sigma_\lambda^2 \), decreasing \( \lambda \).
Reverse Process
The reverse process is described by Gaussian transitions to generate samples progressively from standard Gaussian noise, \(p_\theta(z_{\lambda_\text{min}}) \sim \mathcal{N}(0,I)\), and uses neural networks to parameterize the reverse process,
\[ p_\theta(z_{\lambda'}|z_\lambda) = \mathcal{N}(\tilde{\mu}_{\lambda'|\lambda}(z_\lambda, x_\theta(z_\lambda, \lambda)), (\tilde{\sigma}^2_{\lambda'|\lambda})^{1-\nu}(\sigma^2_{\lambda'|\lambda})^\nu), \]
where \(x_\theta(z_\lambda, \lambda) = \frac{z_\lambda - \sigma_\lambda \epsilon_\theta(z_\lambda, \lambda)}{\alpha_\lambda}\) represents a neural network that estimate of the original data \(x\) given a particular noisy observation \(z_\lambda\), \( \epsilon_\theta(z_\lambda, \lambda) \) is another neural network or the same network with different outputs, predicting the noise component added at the diffusion step characterized by \( \lambda \) and \( \sigma_\lambda \) and \( \alpha_\lambda \) are coefficients that define how the noise scales and the data is attenuated during the forward process at noise level \( \lambda \). The variance \( \sigma^2_{\lambda'| \lambda}\) for the reverse transition from \(z_\lambda\) to \(z_{\lambda'}\), where \( \lambda' > \lambda\) indicates a move towards lower noise levels, can be a function of the variances at each of these noise levels,
\[ \Sigma_{\lambda'| \lambda} = \exp((1-\nu) \log \tilde{\sigma}^2_{\lambda'|\lambda} + \nu \log \sigma^2_{\lambda'|\lambda}), \]
where \( \Sigma_{\lambda'| \lambda}\) is the interpolated variance, as discussed in [2]. The hyperparameter \( \nu\) weights how much the direct step variance versus the target level variance contributes to the transition. It is worth emphasizing that during inference runs in an increasing sequence of \( \lambda\), from higher noise to low noise and clean data.
Training the Model
To train this model, we minimize the discrepancy between the noise added in the forward process and the noise estimated by the model,
\[ \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I), \lambda\sim p(\lambda)} \left[ \| \epsilon_{\theta}(z_{\lambda, \lambda}) - \epsilon \|_2^2 \right], \]
where \( p(\lambda) \) is a distribution over \([\lambda_\text{min}, \lambda_\text{max} ]\), \(\epsilon\) is the noise used in the forward process, \( \epsilon_{\theta}(z_{\lambda}, \lambda) \) is the model estimate of this noise, and \(z_\lambda = \alpha_\lambda x + \sigma_\lambda \epsilon\). This is a form of denoising score matching that helps the model learn to reverse the noise addition accurately. The distribution of \( \lambda \) can affect the training dynamics and the focus on different noise levels. A specific scheduling, such as the cosine noise schedule, modulates \( \lambda \) values across the training and influences the distribution and variance of the noise levels encountered. Uniform distribution on \(\lambda\) results in the objective being proportional to the variational lower bound on the marginal log likelihood of the latent variable model.
Model Inference
Once trained, sampling from the model involves running the reverse process using the predicted noise values to progressively denoise the data. This process uses Langevin dynamics where samples are drawn progressively closer to the distribution of the original dataset \( p_\text{data}(x)\).