Kraskov - HackMD

<style> /* basic design */ .reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6, .reveal section, .reveal table, .reveal li, .reveal blockquote, .reveal th, .reveal td, .reveal p { font-family: 'Meiryo UI', 'Source Sans Pro', Helvetica, sans-serif, 'Helvetica Neue', 'Helvetica', 'Arial', 'Hiragino Sans', 'ヒラギノ角ゴシック', YuGothic, 'Yu Gothic'; text-align: left; line-height: 1.6; letter-spacing: normal; text-shadow: none; word-wrap: break-word; color: #444; } .reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6 {font-weight: bold;} .reveal h1, .reveal h2, .reveal h3 {color: #008080;} /* #2980b9 */ .reveal th {background: #DDD;} .reveal section img {background:none; border:none; box-shadow:none; max-width: 95%; max-height: 95%;} .reveal blockquote {width: 90%; padding: 0.5vw 3.0vw;} .reveal table {margin: 1.0vw auto;} .reveal code {line-height: 1.2;} .reveal p, .reveal li {padding: 0vw; margin: 0vw;} .reveal .box {margin: -0.5vw 1.5vw 2.0vw -1.5vw; padding: 0.5vw 1.5vw 0.5vw 1.5vw; background: #EEE; border-radius: 1.5vw;} /* table design */ .reveal table {background: #f5f5f5;} .reveal th {background: #444; color: #fff;} .reveal td {position: relative; transition: all 300ms;} .reveal tbody:hover td { color: transparent; text-shadow: 0 0 3px #aaa;} .reveal tbody:hover tr:hover td {color: #444; text-shadow: 0 1px 0 #fff;} /* blockquote design */ .reveal blockquote { width: 90%; padding: 0.5vw 0 0.5vw 6.0vw; font-style: italic; background: #f5f5f5; } .reveal blockquote:before{ position: absolute; top: 0.1vw; left: 1vw; content: "\f10d"; font-family: FontAwesome; color: #2980b9; font-size: 3.0vw; } /* font size */ .reveal h1 {font-size: 5.0vw;} .reveal h2 {font-size: 4.0vw;} .reveal h3 {font-size: 2.8vw;} .reveal h4 {font-size: 2.6vw;} .reveal h5 {font-size: 2.4vw;} .reveal h6 {font-size: 2.2vw;} .reveal section, .reveal table, .reveal li, .reveal blockquote, .reveal th, .reveal td, .reveal p {font-size: 2.2vw;} .reveal code {font-size: 1.6vw;} /* new color */ .red {color: #EE6557;} .blue {color: #16A6B6;} /* split slide */ #right {left: -18.33%; text-align: left; float: left; width: 50%; z-index: -10;} #left {left: 31.25%; text-align: left; float: left; width: 50%; z-index: -10;} /* figure alignment */ img[alt$=">"] {float: right;} img[alt$="<"] {float: left;} img[alt$="><"] { display: block; max-width: 100%; height: auto; margin: auto; float: none!important; } </style> <style> /* specific design */ .reveal h2 { padding: 0 1.5vw; margin: 0.0vw 0 2.0vw -2.0vw; border-left: solid 1.2vw #008080; /* #2980b9 */ /* border-bottom: solid 0.8vw #d7d7d7; */ } </style> # KSG mutual information estimate <br><br> #### 2020.03.16 ### Shunsuke Kamiya --- ## Contents * Mutual information * Kozachenko-Leonenko entropy estimator * Kraskov-Stogbauer-Grassberger MI estimator --- ## Goal * **<font color="teal">Mutual information</font>** in **continuous** variables $$ \int \int dxdy P(x,y) \log \frac{P(x)P(y)}{P(x,y)} $$ * Estimating MI is not always easy. * How to estimate MI? --- ## Conventional methods * Cumulant expansions in ICA – very crude * Entropy maximalization in ICA – very crude * Histogram method – inaccurate $$ \boldsymbol{\downarrow} $$ * Kraskov utilised $k$ nearest neighbor method to overcome these problems --- ## 3H estimator * Decomposition to entropies $$ I(X,Y) = H(X) + H(Y) - H(X,Y) $$ * Entropy can be approximated by $k$-nearest neighbor --- ## Kozachenko-Leonenko entropy estimator Before going into MI, an approximation of entropy is suggested in 1987 utilizing $k$ nearest neighbor method. <div style="text-align: left;padding : 10px; border: solid 3px teal;"> $$ \hat{H}(X) = -\psi(k) + \psi(N) + \log c_D + \frac{D}{N}\sum_{i=1}^N \log \epsilon(i) $$ </p> </div> <br> * $X$: random variable with dimension $D$ * $\psi$: digamma function, $\psi(z) = d\log \Gamma(z)/dz$ * $N$: number of data points * $c_D$: volume of unit sphere in dimension X * $\epsilon(i)$: $\times 2$ distance from $x_i$ to its $k$'th neighbor --- ## Details on KL estimator * $X$: **continuous** random varialbe with values in some metric space (= there is a distance $||x-x'||$) * **<font color="teal">Shannon's (differential) entropy</font>**: where $\mu(x)$ is the probability density, $$ H(X) = - \int dx \mu(x) \log \mu(x) \tag{1} $$ * Want to estimate $H(X)$ from a random sample $(x_1, x_2, ..., x_N)$ * (1) $\rightarrow$ average of $\log \mu(x)$. So with unbiased estimators $\widehat{\log \mu(x)}$, we can have an unbiased estimator of the entropy $$ \hat{H}(X) = -N^{-1}\sum_{i=1}^N \widehat{\log\mu(x_i)} $$ --- In order to get the estimate $\widehat{\log \mu(x_i)}$, we set * $P_k(\epsilon)$: probability distribution of distance of $k$'th nearest neighbor of $x_i$ being $\epsilon/2$ * $P_k(\epsilon) d\epsilon$: the chance that, from $x_i$, there are * $k-1$ points within $\epsilon$-ball * one point within distance $r \in [\epsilon/2, \epsilon/2+d\epsilon/2]$ * the other $N-k-1$ outside of $\epsilon$-ball * $p_i(\epsilon)$: possibility of $x_i$ being its neighbor, mass of the $\epsilon$ ball centered at $x_i$, $p_i = \int_{||\xi - x_i||<\epsilon/2}d\xi\mu(\xi)$ --- Then one can have $$ P_k(\epsilon)d\epsilon = \frac{(N-1)!}{(k-1)!(N-k-1)!} \cdot dp_i(\epsilon)\cdot p_i(\epsilon)^{k-1}\cdot (1-p_i(\epsilon))^{N-k-1} $$ or, by denoting $p_i(\epsilon) = p_i$, $$ P_k(\epsilon) = \frac{(N-1)!}{(k-1)!(N-k-1)!} \frac{dp_i}{d\epsilon} p_i^{k-1} (1-p_i)^{N-k-1} $$ * One can check this is normalized applying formula on beta function (note when $\epsilon$ is from $0$ to $\infty$, $p_i$ is from $0$ to $1$) $$ \small \begin{eqnarray} \int_o^{\infty} d\epsilon P_k(\epsilon) &=& \frac{(N-1)!}{(k-1)!(N-k-1)!} B(k, N-k) \\ &=& \frac{(N-1)!}{(k-1)!(N-k-1)!} \frac{(k-1)!(N-k-1)!}{(N-1)!} = 1 \end{eqnarray} $$ --- Now, by assuming $\mu(x)$ to be constant in the entire $\epsilon$ ball, one gets $$ p_i \sim c_D\epsilon(i)^D \mu(x_i) \tag{2} $$ * $D$: dimension of $x$ * $\epsilon(i)$: $\times 2$ distance of $k$'th NN of $x_i$ * $c_D$: volme of $d$-dimensional unit ball ($c_D=1, \frac{\pi ^{D/2}}{\Gamma(1+D/2)\cdot 2^D}$ for maximum norm and for Euclidian norm, respectively) <br> In order to have the unbiased value of the mean of $\log \mu(x)$, we consider taking the $\log$ of eqation (2). --- Taking $\log$ of (2) will give $$ \log \mu(x_i) \sim \log p_i - D \cdot \log \epsilon(i)-\log c_D $$  <br> *<font color="red">My question: why can we take discrete & continuous mean on both hands? → by averaging with $N$, this value is the estimate of <b>differential</b> entropy </font>*<br><br> --- Lastly, one can get the expectation of $\log p_i(\epsilon)$ by following;<br> <font color="red">(presumably making an approximaation of $\log p_i$)</font> $$ \begin{eqnarray} &&\mathbb{E}_{P_k(\epsilon)}(\log p_i) = \int_o^\infty d\epsilon P_k(\epsilon)\log p_i(\epsilon) \\ &=& k \pmatrix{N-1\\k} \int_0^1 dp p^{k-1}(1-p)^{N-k-1}\log p \\ \\ &=& \psi(k) - \psi(N) \tag{*} \\ &=& H_{N-1} - H_{k-1} = \sum_{l=k} ^{N-1} \frac{1}{l} \end{eqnarray} $$ where $H_{n}$ is the harmonic series, $\displaystyle H_n = \sum_{l=1} ^n \frac{1}{l}$. <br>(Equation (\*) is to be proved in the Appendix.) --- Substitute this result and get $$ \log \mu(x_i) \sim \psi(k) - \psi(N) - D \cdot \log \epsilon(i)-\log c_D $$ Lastly, take average through $i=1,\cdots,N$ to give **Kozachenko-Leonenko entropy estimator**;<br><br> <div style="text-align: left;padding : 10px; border: solid 3px teal;"> <u><b> KL estimator (repeated)</b></u> <br/> $$ \hat{H}(X) = -\psi(k) + \psi(N) + \log c_D + D \langle \log \epsilon_k (i) \rangle_{i=1,\cdots,N} $$ where $\epsilon_k (i)$ is $\times 2$ distance from $x_i$ to its $k$'th neighbor. <p/> </div> <br> * KL estimator: $p_i$ approximated as $p_i \sim \exp(\psi(k))/\exp(\psi(N))$ * This is slightly different from the ordinary $k$ nearest neighbor method which approximates $p_i$ as $p_i \sim k/N$ --- ## KSG Mutual information estimator With the same basic idea ($k$ nearest neighbor) and some modifications, one can get estimation for mutual infomation. <div style="text-align: left;padding : 10px; border: solid 3px teal;"> $$ \hat{I}(X;Y) = \psi(k) - \langle \psi(n_x+1) + \psi(n_y+1) \rangle_N + \psi(N) $$ </p> </div> <br> * $n_x(i)$: number of points whose distane from $x_i$ is strictly less than that of $k$'th nearest neighbor point in joint spae of $X$ and $Y$ * $n_y(i)$: same as above except $x \rightarrow y$ --- ## Details on KSG MI estimator We now think of extending KL estimator in a high dimension, by jointing two variables $X$ and $Y$ as $Z = (X,Y)$ $$ \hat{H}(X,Y) = -\psi(k) + \psi(N) + \log (c_{D_X}c_{D_Y}) + \frac{D_X+D_Y}{N}\sum_{j=1}^N \log \epsilon _k(j) $$ *<font color="red">My question: $c_{D_{X,Y}}=c_{D_X}c_{D_Y}$ holds only when $c_D$ is one dimensional?</font>* --- In order to get $I(X;Y)$, we have to substitute this value in $$ I(X,Y) = H(X) + H(Y) - H(X,Y) $$  To keep bias small, consider letting the last term $\frac{D}{N}\sum_{j=1}^N \log \epsilon_k(j)$ cancel.<br><br> Let's say, the $k$'th Chebycev NN of a point $z_i = (x_i, y_i)$, whose distance is $\epsilon_k/2$, is the difference of $x$ coordinates. And let the $k$'th Chebycev NN be $k_x(i)$'th NN in $x$ coordinate. $k_x(i)$ is described as $k_x(i) = n_x(i) +1$ if there are altogether $n_x(i)$ points in the range from $x=x_i - \epsilon_k/2$ to $x=x_i + \epsilon_k/2$ ![image alt >](https://i.imgur.com/v1us4au.png) --- Then the KL estimator is re-written as $$ \hat{H}(X) = -\psi(n_x(i)+1) + \psi(N) + \log c_{D_X} + \frac{D}{N}\sum_{j=1}^N \log \epsilon_k(j) $$ Taking average in $i = 1,\cdots,N$ will give $$ \hat{H}(X) = -\langle \psi(n_x(i)+1) \rangle _{i=1,\cdots,N} + \psi(N) + \log c_{D_X} + D \langle \log \epsilon_k(j) \rangle_{j = 1,\cdots,N} $$ <font color="red"><i> My question: I don't quite get why we have to take an average, maybe to reduce the error by adding information when $k$ is large, when $n_y(i) \rightarrow \infty$?? </i> </font><br> --- $\hat{H}(Y)$: although $\epsilon/2$ is not the distance of the $(n_y(i)+1)$'st nearest neighbor, we let that to be true (this approximation is validated when $n_y(i)\rightarrow \infty$ and thus also when $N \rightarrow \infty$, Kraskov says in paper) $$ \hat{H}(Y) \sim-\langle \psi(n_y(i)+1) \rangle _{i=1,\cdots,N} + \psi(N) + \log c_{D_Y} + D \langle \log \epsilon_k(j) \rangle_{j = 1,\cdots,N} $$ Substituting this to get the following equation;<br><br> <div style="text-align: left;padding : 10px; border: solid 3px teal;"> <u><b> KSG MI estimator (repeated)</b></u> <br/> $$ \hat{I}(X;Y) = \psi(k) - \langle \psi(n_x+1) + \psi(n_y+1) \rangle_N + \psi(N) $$ </p> </div> --- There are two approximations in KSG estimator; 1. $p_i \sim c_D\epsilon(i)^D\mu(x_i)$ 1. Letting $\epsilon$ the distance of the same point in two coodinates <br><br> Although errors in $\hat{H}(X)$,$\hat{H}(Y)$, and $\hat{H}(X,Y)$ usually do not cancel, the above method is better than when we used different length sales in the three estimations  --- ## Application 1. Gomez-Herrero's Entropy combination Let $V$ denote random variable, then an **entropy combination** is defined $$ C(V_{\mathcal{L}_1, \cdots ,V_{\mathcal{L}_p}}) = \sum_{i=1}^p s_i H(V_{\mathcal{L}_i}) - H(V) $$ With this framework, one can apply KL estimator not only to MI but to **conditional MI** and **transfer entropy**/**conditional transfer entropy**. $$ \hat{I}(X;Y|Z) = \psi(k) - \langle \psi(k_{xz}(n)+\psi(k_{zy}(n))-\psi(k_{z}(n))) \rangle_n \\ \hat{T}(X;Y) = \psi(k) - \langle \psi(k_{wx}(n)+\psi(k_{xy}(n))-\psi(k_{x}(n))) \rangle_n \\ \hat{T}(X;Y|Z) = \psi(k) - \langle \psi(k_{wxz}(n)+\psi(k_{xzy}(n))-\psi(k_{xz}(n))) \rangle_n $$ --- ## Application 2. Gao's MI in discrete-continuous mixture Gao --- ## Appendix. Proof of equation (\*) We want to proove equation (\*), which is $$ k \pmatrix{N-1\\k} \int_0^1 dp p^{k-1}(1-p)^{N-k-1}\log p = \psi(k) - \psi(N) $$ Let us denote $$ I_{N,k} = \int_0^1 dp p^{k-1}(1-p)^{N-k-1}\log p $$ We are going to find the relationship between $I_{N,k}$ and say, $I_{N,k+1}$ --- By parial integral $$ \begin{eqnarray} && I_{N,k} = \int_0^1 dp p^{k-1}(1-p)^{N-k-1}\frac{d}{dp}(p\log p - p) \\ &=& \left[ p^{k-1}(1-p)^{N-k-1} (p\log p - p) \right]_0^1 - \int_0^1 dp \frac{d}{dp}\left( p^{k-1}(1-p)^{N-k-1} \right)(p\log p - p) \\ &=& - \int_0^1 dp \{ (k-1) p^{k-2}(1-p)^{N-k-1} - (N-k-1) p^{k-1}(1-p)^{N-k-2} \} (p\log p - p)\\ &=& - (k-1)I_{N,k} + (N-k-1)I_{N,k+1} - (k-1)B(k,N-k) \\ && \hskip 13em + (N-k-1)B(k+1,N-k-1) \tag{A1} \end{eqnarray} $$ where $B(a,b)$ is the beta function. --- Consider $$xB(x,y+1) = yB(x+1,y) $$ and by letting $x=k, y=N-k$, $$ \begin{eqnarray} &&-(k-1)B(k,N-k) + (N-k-1)B(k+1,N-k-1) \\ &=& - (k-1)B(k,N-k) + (N-k-1)\cdot \frac{k}{N-k-1} \cdot B(k,N-k) \\ &=&B(k,N-k) \end{eqnarray} $$ We substitute this to (A1) to get $$ I_{N,k} = - (k-1)I_{N,k} + (N-k-1)I_{N,k+1} + B(k,N-k) $$ --- Thus $$ k I_{N,k} = (N-k-1)I_{N,k+1} - B(k,N-k) $$ And using this, one can proove (\*) by mathematical induction **Mathematical induction**. Let us fix N, and assume (\*) to be true in $k=k$. Let us consider $k \rightarrow k+1$ $$ \begin{eqnarray} &&(k+1) \pmatrix{N-1\\k+1} I_{N,k+1} \\ &=&(k+1) \pmatrix{N-1\\k+1} \frac{1}{N-k-1} \left( kI_{N,k} + B(k,N-k) \right) \\ &=& (k+1) \pmatrix{N-1\\k+1} \frac{1}{N-k-1} \left( k \cdot \frac{\sum_{l=k}^{N-1}\frac{1}{l}}{k\pmatrix{N-1\\k}}+ B(k,N-k) \right) \end{eqnarray} $$ --- ## References Kraskov 2004 Takashina Estimating Mutual Information for Discrete‐Continuous Mixtures 離散・連続混合の相互情報量の推定 2017 @SlideShare Gomez-Herrero 2010 Gao 2017