Introduction
Latent Diffusion Models (LDMs) are a class of generative models that extend the idea of diffusion models to a latent space. Traditional diffusion models involve denoising a sample from a simple noise distribution (e.g., Gaussian) to produce a sample from the target data distribution, guided by a learned denoising function. LDMs apply this concept in a compressed latent space, leading to computational efficiency and better handling of high-dimensional data.
Overview
The core idea of LDMs is to learn a mapping between the data space and a lower-dimensional latent space. Let \(X\) denote the data space and \(Z\) the latent space. The mapping consists of an encoder \(E: X \rightarrow Z\) and a decoder \(D: Z \rightarrow X\).
The diffusion process in the latent space is defined by a forward process and a reverse process. The forward process gradually adds noise to a latent variable \(z_0\) to produce a sequence of increasingly noisy latents \(z_1, z_2, \ldots, z_T\). This is modeled by a Markov chain with transition probabilities \(q(z_{t+1}|z_t)\).
The reverse process aims to denoise the latent variables back to the original signal. It is defined by a reverse Markov chain with transition probabilities \(p_\theta(z_{t-1}|z_t)\), where \(\theta\) are the parameters of the model learned during training.
Training
The training objective for LDMs is to learn the parameters \(\theta\) that maximize the likelihood of the reverse process. This is typically done using a variational lower bound on the log-likelihood, leading to a loss function similar to that used in variational autoencoders (VAEs). The loss function can be decomposed into a reconstruction term and a KL-divergence term:
\[\mathcal{L}(\theta) = \mathbb{E}_{q(z|x)}[\log p_\theta(x|z)] - \beta \cdot D_{KL}(q(z|x) || p(z)),\]
where \(x\) is a data sample, \(z\) is its latent representation, \(p_\theta(x|z)\) is the likelihood of the data given the latents, \(q(z|x)\) is the approximate posterior (encoder), \(p(z)\) is the prior over the latents, and \(\beta\) is a hyperparameter controlling the trade-off between the two terms.
Conditional Generation with LDMs
Encoding: Encode data samples into latent space using an encoder \(E\):
\[ z_0 = E(x) \]
Diffusion: Apply forward diffusion to obtain noisy latents \(z_1, z_2, \ldots, z_T\)
Conditioned Denoising: Train reverse process \(p_\theta\) to denoise latents back to \(z_0\):
\[ z_{t-1} = p_\theta(z_{t-1}|z_t) \]
Decoding: Use decoder \(D\) to generate final samples from denoised latents:
\[ \hat{x} = D(z_0) \]
Optimization: Update parameters \(\theta\) by minimizing loss \(\mathcal{L}(\theta)\) using gradient descent