Wasserstein Distance and Path Coupling

# Wasserstein Distance and Path Coupling Consider a finite metric space $(\Omega, \rho)$. Let $\Delta(\Omega)$ be the set of probability distributions on $\Omega$ (which can be viewed as a simplex of dimension $|\Omega|-1$). For $\mu,\nu\in\Delta(\Omega)$ let $\Pi(\mu,\nu)$ be the set of couplings of $\mu$ and $\nu$, i.e. probabilities distributions on $\Omega\times\Omega$ with marginals $\mu$ and $\nu$. ::: spoiler *Equivalent definition of coupling* $\Pi(\mu,\nu)$ is also defined as the collection of random variables $(X,Y)$ on some probability space taking values in $\Omega\times\Omega$, such that $X$ has distribution $\mu$ and $Y$ has distribution $\nu$. To see the equivalence, let $(X,Y)$ be such a pair of random variables, then their joint law is in $\Pi(\mu,\nu)$ by definition. Conversely, given any $q\in\Pi(\mu,\nu)$, the identity function from the probability space $(\Omega\times\Omega, q)$ tp $\Omega\times\Omega$ gives such a pair of random variables. ::: ## Transportation metric / Wasserstein Distance The metric $\rho$ on $\Omega$ induces a metric on $\Delta(\Omega)$, called the transportation metric or Wasserstein distance: $$W_1(\mu,\nu):=\inf_{q\in\Pi(\mu,\nu)} \mathbb{E}_{(X,Y)\sim q} [\rho(X,Y)] = \inf_{q\in\Pi(\mu,\nu)} \sum_{x\in\Omega}\sum_{y\in\Omega} \rho(x,y)q(x,y).$$ Note that when $\rho(x,y):=\mathbb{1}_{x\neq y}$ is the discrete metric, the associated Wasserstein distance $W_1(\mu,\nu)=\|\mu-\nu\|_{\text{TV}}$ is just the total variation distance. :::spoiler *Why is it called the transportation metric?* Suppose we are given a unit of some material spread across $n$ locations where $\mu\in\Delta([n])$ specifies the proportion of the material at each site, and our goal is to relocate the material so that the configuration follows another given distribution $\nu\in\Delta([n])$. Moving a unit from site $i$ to $j$ incurs a cost of $\rho(i,j)$, and we want to achieve the transportation from $\mu$ to $\nu$ with minimal cost (hence the name optimal transport). For $i,j\in[n]$ let $p(i,j)$ be the proportion of material at site $i$ to be moved to site $j$, which must satisfy $$\nu(j)=\sum_{i=1}^n \mu(i)P(i,j).$$ The total cost when using $p$ is $$\sum_{i=1}^n\sum_{j=1}^n \rho(i,j)\mu(i)p(i,j).$$ Since $q(i,j):=\mu(i)p(i,j)\in\Pi(\mu,\nu)$ the problem is equivalent to finding $q(i,j)\in\Pi(\mu,\nu)$ that minimizes $\mathbb{E}_{(i,j)\sim q} [\rho(i,j)]$. ::: When infimum in the definition is attained by some $q_\star\in\Pi(\mu,\nu)$, $q_\star$ or equivalently its corresponding random variable pair $(X_\star, Y_\star)$ is called an $\rho$-optimal coupling of $\mu$ and $\nu$. There always exist some optimal coupling, since $\Pi(\mu,\nu)\subseteq \mathbb{R}^{|\Omega|^2}$ is a closed subset of the $|\Omega|^2-1$ dimensional simplex and hence is compact, and the function $$q\in \Pi(\mu,\nu)\mapsto \sum_{x,y\in\Omega}\rho(x,y)q(x,y)$$ is continuous. We now verify that $W_1$ is indeed a metric on $\Delta(\Omega)$. It is clearly nonnegative, and $W_1(\mu,\mu)=0$ for all $\mu$ since one can the coupling $(X,X)$ for $X\sim\mu$. If $W_1(\mu,\nu)=0$, let $q$ be an optimal coupling, so that $\mathbb{E}_q [\rho(X,Y)]=0$, so $\rho(X,Y)=0$ $q$-a.s. and hence $X=Y$ $q$-a.s., which implies $\mu=\nu$. That $W_1$ is symmetric follows from $\rho$ being symmetric. To see the triangle inequality, for $\mu,\nu,\eta\in\Delta(\Omega)$ let $p\in\Pi(\mu,\nu)$ and $q\in\Pi(\nu,\eta)$ be optimal couplings. Define the probability distribution $r$ on $\Omega^3$ by $$r(x,y,z):=\frac{p(x,y)q(y,z)}{\nu(y)},$$ so that the projections onto the $(x,y)$ and $(y,z)$ coordinates are $p$ and $q$ repsectively, and that the projection onto the $(x,z)$ coordinate is a coupling of $\mu$ and $\eta$. Now for $(X,Y,Z)\sim r$ we have $\rho(X,Z)\leq \rho(X,Y)+\rho(Y,Z)$ and hence by optimality $$W_1(\mu,\eta)\leq \mathbb{E}[\rho(X,Z)]\leq W_1(\mu,\nu)+W_1(\nu,\eta).$$ ## Wasserstein Distance Induced by Path Metrics Suppose we are given a connected undirected graph $(\Omega, E)$ where each edge $(x,y)\in E$ is associated with a length $l(x,y)\geq 1$. For any $x,y\in \Omega$ define the metric $\rho(x,y)$ to be the length of the shortest path from $x$ to $y$. Note that $\rho(x,y)\geq \mathbb{1}_{x\neq y}$ and hence for any pair $(X,Y)$ we have $$ \mathbb{P}(X\neq Y) = \mathbb{E}[\mathbb{1}_{X\neq Y}] \leq \mathbb{E}\rho(X,Y),$$ and minimizing over all couplings we get $$\|\mu-\nu\|_{\text{TV}} \leq W_1(\mu,\nu).$$ ## Path Coupling Suppose in addition to the weighted graph structure $(\Omega, E, l, \rho)$ above, we are given a Markov chain $P$ with state space $\Omega$ (which need not be related to the edge structure). The following theorem tells us that in order to control the Wasserstein distance induced by a path metric $\rho$, it suffices to construct couplings for neighboring pairs $(x,y)\in E$. ### Theorem 14.6 in [1] Suppose that for each $(x,y)\in E$ there exists a coupling $(X_1,Y_1)$ of the distributions $P(x,\cdot)$ and $P(y,\cdot)$ such that $$\mathbb{E}[\rho(X_1,Y_1)] \leq e^{-\alpha}\rho(x,y),$$ where $\alpha>0$ is some universal constant. Then for any $\mu,\nu\in\Delta(\Omega)$ we have $$W_1(\mu P, \nu P) \leq e^{-\alpha} W_1(\mu,\nu),$$ namely advancing the chain results in the exponential decay of $W_1$. :::spoiler *Remark* As an immediate corollary, we have $$d(t;x):=\|P^t(x,\cdot)-\pi \|_{\text{TV}}\leq e^{-\alpha t}\text{diam}(\Omega),$$ where $\pi$ is the unique stationary distribution (which exists due to the exponential decay of $W_1$ under $P$ and the fact that $\|\cdot\|_{\text{TV}}$ is complete) and $\text{diam}(\Omega):=\max_{x,y}\rho(x,y)$. In particular, we only assume that $(\Omega, E, l)$ is undirected, connected and $l(e)\geq 1$ for all $e\in E$ and NOT the irreducibility of $P$, so that $\pi$ could be supported on a proper subset of $\Omega$. ::: :::spoiler *Proof* For $x\in\Omega$ let $p_x:=P(x,\cdot)$. By assumption, for any $(a,b)\in E$ we have $W_1(p_a, p_b)\leq e^{-\alpha} l(a,b)$. For any $x,y\in\Omega$ choose a minimzing path from $x$ to $y$ and thus $$W_1(p_x, p_y) \leq e^{-\alpha}\rho(x,y).$$ To finish the proof we need to construct a coupling between $\mu P$ and $\nu P$. Let $\eta$ be an optimal coupling of $\mu$ and $\nu$ and let $\theta_{x,y}$ be an optimal coupling of $p_x$ and $p_y$. Then $$\theta:=\sum_{x,y\in\Omega} \eta(x,y)\theta_{x,y}$$ is a coupling of $\mu P$ and $\nu P$. Therefore $$\begin{align} W_1(\mu P, \nu P) & \leq \sum_{v,w\in\Omega} \rho(v,w)\theta(v,w) = \sum_{v,w\in\Omega} \sum_{x,y\in\Omega} \rho(v,w)\theta_{x,y}(v,w) \eta(x,y) \\ & \leq e^{-\alpha}\sum_{x,y\in\Omega} \rho(x,y)\eta(x,y) = e^{-\alpha} W_1(\mu,\nu). \end{align} $$ ::: # References [1] [Levin, David A., and Yuval Peres. Markov chains and mixing times. Vol. 107. American Mathematical Soc., 2017.](https://pages.uoregon.edu/dlevin/MARKOV/markovmixing.pdf)