Gamma Denoising Diffusion Models

The Gamma diffusion model introduces the use of the Gamma distribution for the noise component in the diffusion process rather than just Gaussian noise. This adaptation allows the model to handle data distributions with attributes such as skewness and heavy tails more effectively. In this model, noise added during the forward process follows a Gamma distribution with parameters tuned to match the evolving statistics of the data at each step. During the reverse process, the model estimates and subtracts this Gamma noise to reconstruct the original data from its noisy state. It is shown that adding Gamma noise enhances the model performance, especially in terms of speed and quality of the generation process for both images and speech. These distributions allow for a better fit during the modeling of the underlying data distributions, which can lead to faster convergence and higher quality outputs [1].

Forward Process

For the Gamma distribution, the forward process adds Gamma noise instead of Gaussian noise. The forward process is

\( x_t = \sqrt{1-\beta_t} x_{t-1} + \big(g_t - \mathbb{E}(g_t)\big), \)

where \( g_t \sim \Gamma(k_t, \theta_t) \) and \( k_t = \frac{\beta_t}{\alpha_t\theta_0^2} \) and \( \theta_t = \sqrt{\bar{\alpha}_t}\theta_0 \) (\( \theta_0 \) is a hyperparameter) are the shape and scale parameters of the Gamma distribution, respectively¹. It is worth noting that \( \text{Var}(g_t - \mathbb{E}(g_t)) = \beta_t \), which is similar to the Gaussian noise. Due to the properties of Gamma distribution², the closed-form update for the forward process becomes

\( x_t = \sqrt{\bar{\alpha}_t} x_0 + \big(\bar{g}_t - \mathbb{E}(\bar{g}_t )\big), \)

where \( \bar{g}_t \sim \Gamma(\bar{k}_t, \theta_t) \), \( \mathbb{E}(\bar{g}_t )= \bar{k}_t \theta_t \), and \( \bar{k}_t = \sum_{i=1}^t k_i \).

Training Objective

The training objective for a diffusion model using Gamma noise is similar to the Gaussian noise diffusion model where it minimizes the error between the predicted and actual noise values added at each diffusion step. The loss function often used is the \( L_1 \) norm of the difference between the noise estimated by the model \( \epsilon_\theta(x_t, t) \) and the actual noise scaled by the standard deviation of the noise \( \sqrt{1 - \bar{\alpha}_t} \),

\( L(\theta) = \left\|\frac{\bar{g}_t - \mathbb{E}(\bar{g}_t )}{\sqrt{1 - \bar{\alpha}_t}} - \epsilon_\theta(x_t, t)\right\|_1. \)

Inference

During inference, the model utilizes the trained parameters to reverse the diffusion process and reconstruct the original data from noisy observations. The update equation for the inference step utilizes the Langevin dynamics

\( x_{t-1} = \frac{1}{\sqrt{\bar{\alpha}_t}} \left(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right) + \sigma_t \left(\frac{\bar{g}_t - \mathbb{E}(\bar{g}_t )}{\text{Var}(\bar{g}_t )}\right). \)

This process iteratively estimates the denoised data from \( x_T \sim \Gamma(\bar{k}_T, \theta_T) \) until reaching the initial data state \( x_0 \). Similar to DDPMs, we can set \( \sigma_t^2 = \beta_t \).

Applications

The use of the Gamma distribution in diffusion models has shown notable improvements in generative tasks for both images and speech compared to traditional Gaussian noise models. In image generation, specifically on the CelebA dataset, models incorporating Gamma noise achieved superior FID scores, indicating that the generated images were statistically closer to real images and thus of higher quality. In speech generation, the Gamma noise models outperformed Gaussian models in metrics such as Perceptual Evaluation of Speech Quality (PESQ) and short-time objective intelligibility (STOI), although they slightly underperformed in Mel-Cepstral Distortion (MCD), likely due to Gamma's distinct noise characteristics affecting spectral sharpness. Furthermore, statistical analysis revealed that Gamma noise provides a better fit than Gaussian in modeling the distribution of differences between original and noise-added data over multiple diffusion steps, [1].

Gamma Distribution Properties

[Property 1] Assume \( g \sim \Gamma(k, \theta) \) where \( k \) and \( \theta \) are the shape and scale parameters, respectively. The mean of gamma distribution is \( \mu = k\theta \) and the variance is \( \text{Var} = k\theta^2 \). The reparameterization trick states that \( g = \theta z \), where \( z \sim \Gamma(k,1) \).

[Property 2] Assume \( z_1 \sim \Gamma(k_1, \theta) \) and \( z_2 \sim \Gamma(k_2, \theta) \), then \( z_1 + z_2 \sim \Gamma(k_1 + k_2, \theta) \).

References

[1] Eliya Nachmani, Robin San Roman, and Lior Wolf. Non gaussian denoising diffusion models. arXiv preprint arXiv:2106.07582, 2021.