###### tags: `one-offs` `transport` `bregman`
# Bregman Cost Functions in Optimal Transport
**Overview**: In this note, I present a problem in Optimal Transport which involves Bregman divergences, and sketch its solution.
## Problem
Consider a measure transport problem with cost given by a Bregman divergence, i.e.
\begin{align}
c\left(x,y\right)=F\left(y\right)-F\left(x\right)-\left\langle \nabla F\left(x\right),y-x\right\rangle
\end{align}
with $F$ e.g. strictly convex and $C^{2}$, so that one is aiming to solve
\begin{align}
\min_{\gamma\in\Gamma\left(\mu,\nu\right)}\left\{ C\left(\gamma\right):=\int\gamma\left(\mathrm{d}x,\mathrm{d}y\right)\cdot c\left(x,y\right)\right\} .
\end{align}
**Task**: Show by a (nonlinear) change of coordinates that this is equivalent to solving an $L^{2}$ OT problem for another pair of probability measures.
**Bonus Task**: Deduce a Brenier-type structure theorem for the $c$-optimal coupling between the measures (being appropriately generous with assumptions, etc.).
## Solution
Write $\hat{x}=\nabla F\left(x\right)$, $\hat{\mu}=\left(\nabla F\right)_{\#}\mu$, so that $c\left(x,y\right)=:\hat{c}\left(\hat{x},y\right)=F^{*}\left(\hat{x}\right)+F\left(y\right)-\left\langle \hat{x},y\right\rangle$.
Similarly, translate between $\Gamma\left(\mu,\nu\right)$ and $\Gamma\left(\hat{\mu},\nu\right)$ through the map $\nabla F\otimes\mathrm{Id}$, writing $\int\gamma\left(\mathrm{d}x,\mathrm{d}y\right)\cdot c\left(x,y\right) = \int\hat{\gamma}\left(\mathrm{d}\hat{x},\mathrm{d}y\right)\cdot\hat{c}\left(\hat{x},y\right)$.
Note now that
\begin{align}
\hat{c}\left(\hat{x},y\right) &=F^{*}\left(\hat{x}\right)+F\left(y\right)-\left\langle \hat{x},y\right\rangle \\
&=\left(F^{*}\left(\hat{x}\right)-\frac{1}{2}\left|\hat{x}\right|_{2}^{2}\right)+\left(F\left(y\right)-\frac{1}{2}\left|y\right|_{2}^{2}\right)+\frac{1}{2}\left|x-y\right|_{2}^{2},
\end{align}
and so
\begin{align}
C\left(\gamma\right)=\hat{C}\left(\hat{\gamma}\right)=&\hat{\mu}\left(F^{*}-\frac{1}{2}\left|\cdot\right|_{2}^{2}\right)+\nu\left(F-\frac{1}{2}\left|\cdot\right|_{2}^{2}\right)\\
&+\int\hat{\gamma}\left(\mathrm{d}\hat{x},\mathrm{d}y\right)\cdot\left(\frac{1}{2}\left|x-y\right|_{2}^{2}\right).
\end{align}
Noting that the first two terms on the right-hand side are independent of the chosen coupling, one sees that the cost is minimised by taking $\hat{\gamma}$ to be the $L^{2}$-optimal transport coupling between $\hat{\mu}$ and $\nu$. As such, naive 'Bregman-ising' of the optimal transport seems not to lead to essentially new problems.
For the bonus problem, note that the Brenier theorem says that $L^{2}$-optimal transport from $\hat{\mu}$ to $\nu$ can be attained through a map of the form $T\left(x\right)=\nabla\Phi\left(x\right)$, for some convex $\Phi$. Noting that $\hat{\mu}$ is obtained from $\mu$ by transport under the map $\nabla F$, it follows that the Bregman-optimal mapping from $\mu$ to $\nu$ takes the form $\hat{T}\left(x\right)=\left(\nabla\Phi\circ\nabla F\right)\left(x\right)$, and we conclude.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**