Introduction
Generative models often encounter significant challenges when dealing with data that lie on non-Euclidean manifolds. The primary issue is the manifold mismatch, where the latent space of these models is typically Euclidean. Data with intrinsic geometric and topological properties, such as those found on spheres, tori, or special orthogonal groups, cannot be accurately represented in Euclidean spaces. This mismatch leads to poor representation and significant reconstruction errors, as the models struggle to capture the true underlying structure of the data.
Another critical challenge is sampling and interpolation. In Euclidean spaces, these processes are straightforward, but they do not translate well to manifold spaces. For instance, linear interpolation between points on a sphere often results in paths that do not lie on the sphere, producing unrealistic interpolations and invalid samples. This issue is further compounded by the difficulties in computing the KL-divergence for manifold-valued latent spaces. Standard Euclidean-based approximations for KL-divergence can be inaccurate and lead to suboptimal training and convergence issues.
In particular, traditional VAEs often struggle with topological discrepancies between the manifold structure of the data and the Euclidean structure of the latent space. Diffusion Variational Autoencoder \( \Delta \)VAEs utilizes the transition kernels of Brownian motion on various manifolds to better model complex data topologies, including spheres, tori, and projective spaces. This model seeks to align the latent space more closely with the inherent structure of the data.
Core Components and Mathematical Formulation
A traditional Variational Autoencoder contains:
- A latent space \( Z \)
- A prior probability distribution \( P_Z \) on \( Z \)
- A family of encoder distributions \( Q_\alpha(Z) \) on \( Z \), parameterized by \( \alpha \)
- A family of decoder distributions \( P_\beta(X) \) on the data space \( X \), parameterized by \( \beta \)
- Encoder and decoder neural networks
The objective is to minimize the negative evidence lower bound (ELBO):
\[ L(x) = -E_{z \sim Q_\alpha(x)} \left[ \log P_\beta(x|z) \right] + D_{KL}(Q_\alpha(x) \parallel P_Z), \]
where \( E_{z \sim Q_\alpha(x)} \left[ \log P_\beta(x|z) \right] \) is the reconstruction error and \( D_{KL}(Q_\alpha(x) \parallel P_Z) \) is the Kullback-Leibler divergence between the approximate posterior \( Q_\alpha \) and the prior \( P_Z \).
Geometric Foundations and Diffusion on Manifolds
A Riemannian manifold \((\mathcal{M}, g)\) is a smooth manifold \(\mathcal{M}\) equipped with a Riemannian metric \(g\), which is a smooth varying positive definite inner product on the tangent space at each point. The metric \(g\) allows for the measurement of geometric quantities like lengths, angles, and volumes, which are critical in defining and analyzing Brownian motion.
The probability measure on \( \mathcal{M} \) is defined in terms of the Riemannian volume form. The prior \( P_Z \) over the latent variables \( z \in \mathcal{M} \) is often taken to be the normalized Riemannian volume measure, denoted as \( \text{vol}_\mathcal{M} \). This measure respects the manifold's geometric structure.
The encoder in a \( \Delta \)VAE maps input data \( x \in \mathcal{X} \) (data space) to a point on the manifold \( \mathcal{M} \). This is achieved through a neural network function \( f_\theta : \mathcal{X} \rightarrow \mathcal{M} \), parameterized by weights \( \theta \).
Mathematically, the encoder defines a family of probability distributions \( Q_\alpha(z|x) \) over \( \mathcal{M} \) for each input \( x \), where \( \alpha \) is a parameter derived from \( x \) via the encoder network.
Brownian motion on \( \mathcal{M} \) is a continuous stochastic process \( \{X_t\}_{t \geq 0} \) with paths that are almost surely continuous, and which possesses the Markov property — the future evolution depends only on the present state, not on the path that got there. On a Riemannian manifold, Brownian motion is also characterized as the diffusion process generated by the Laplace-Beltrami operator.
The Laplace-Beltrami operator \( \Delta \) is the natural extension of the Laplacian to Riemannian manifolds and plays a central role in defining the dynamics of Brownian motion. It is defined using the metric \( g \) as follows:
- Gradient: The gradient of a function \( f: \mathcal{M} \rightarrow \mathbb{R} \), denoted \( \nabla f \), is the vector field on \( \mathcal{M} \) pointing in the direction of the greatest rate of increase of \( f \), with the magnitude of the rate of increase.
- Divergence: The divergence of a vector field \( X \) on \( \mathcal{M} \), denoted \( \text{div}(X) \), measures the rate of change of volume of the flow generated by \( X \).
- Laplace-Beltrami Operator: For a smooth function \( f \), the Laplace-Beltrami operator is given by: \[ \Delta f = \text{div}(\nabla f) \] It represents the divergence of the gradient of \( f \), effectively measuring how much \( f \) deviates locally from its mean value.
Brownian Motion on Manifolds
Brownian motion on \(\mathcal{M}\) can be characterized as the solution to the stochastic differential equation (SDE):
\[ dX_t = \sqrt{2} dB_t, \]
where \(dB_t\) represents the infinitesimal Brownian increments on the tangent space of \(\mathcal{M}\) at \(X_t\). However, this representation oversimplifies the complexity since the increments need to respect the manifold's structure.
A more intrinsic characterization involves the Stratonovich SDE:
\[ dX_t = \sum_{i=1}^n V_i(X_t) \circ dB_t^i, \]
where \( \{V_i\} \) are vector fields forming an orthonormal frame around \(X_t\), and \( \circ \) denotes the Stratonovich integration. The \(V_i\) fields ensure that the motion respects the manifold's geometry by adjusting the direction of the Brownian motion according to the curvature and topology.
We introduce Brownian motion on a Riemannian manifold to define the encoder distributions. Brownian motion is the solution to the heat equation on a manifold, providing a natural way to define transition kernels for the encoder:
\[ Q_\alpha(z) = p(t, x, y) = \frac{1}{(4\pi t)^{d/2}} \exp \left( -\frac{d(x, y)^2}{4t} \right) \]
Where \( d(x, y) \) is the geodesic distance between points \( x \) and \( y \) on the manifold.
The transition probability \(p(t, x, y)\), which is the density function of the process being at point \(y\) at time \(t\) starting from \(x\), satisfies the heat equation associated with the Laplace-Beltrami operator:
\[ \frac{\partial p}{\partial t} = \Delta p. \]
The heat kernel on \(\mathcal{M}\) provides a fundamental solution to this heat equation and encodes comprehensive information about the geometry and topology of \(\mathcal{M}\). It is crucial for constructing transition kernels in models like \(\Delta\)VAE, as it helps in defining a mathematically grounded mechanism for sampling and propagating through the latent manifold space.
Implementation Details:
- Latent Space \( Z \): Chosen to be a Riemannian manifold.
- Prior \( P_Z \): The normalized Riemannian volume measure.
- Encoder Distributions \( Q_\alpha(Z) \): Transition kernels of Brownian motion.
- Decoder Distributions \( P_\beta(X) \): Generally assumed to be Gaussian with mean \( \beta \).
For implementing VAEs with manifold latent spaces, the reparametrization trick is crucial for gradient-based optimization. To define a reparametrization trick suitable for general submanifolds of Euclidean space, we leverage the embedding of manifolds into Euclidean space (Whitney and Nash embedding theorems) to construct their method. Develop a reparametrization that works with the geometry of \( \mathcal{M} \). For example, if \( \mathcal{M} \) is a sphere, use spherical coordinates and the exponential map for reparametrization.
In addition, we approximate or analytically compute the KL divergence using the properties of the heat kernel on \( \mathcal{M} \), considering the manifold's curvature and other geometric properties.