<style> /* basic design */ .reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6, .reveal section, .reveal table, .reveal li, .reveal blockquote, .reveal th, .reveal td, .reveal p { font-family: 'Meiryo UI', 'Source Sans Pro', Helvetica, sans-serif, 'Helvetica Neue', 'Helvetica', 'Arial', 'Hiragino Sans', 'ヒラギノ角ゴシック', YuGothic, 'Yu Gothic'; text-align: left; line-height: 1.6; letter-spacing: normal; text-shadow: none; word-wrap: break-word; color: #444; } .reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6 {font-weight: bold;} .reveal h1, .reveal h2, .reveal h3 {color: #008080;} /* #2980b9 */ .reveal th {background: #DDD;} .reveal section img {background:none; border:none; box-shadow:none; max-width: 95%; max-height: 95%;} .reveal blockquote {width: 90%; padding: 0.5vw 3.0vw;} .reveal table {margin: 1.0vw auto;} .reveal code {line-height: 1.2;} .reveal p, .reveal li {padding: 0vw; margin: 0vw;} .reveal .box {margin: -0.5vw 1.5vw 2.0vw -1.5vw; padding: 0.5vw 1.5vw 0.5vw 1.5vw; background: #EEE; border-radius: 1.5vw;} /* table design */ .reveal table {background: #f5f5f5;} .reveal th {background: #444; color: #fff;} .reveal td {position: relative; transition: all 300ms;} .reveal tbody:hover td { color: transparent; text-shadow: 0 0 3px #aaa;} .reveal tbody:hover tr:hover td {color: #444; text-shadow: 0 1px 0 #fff;} /* blockquote design */ .reveal blockquote { width: 90%; padding: 0.5vw 0 0.5vw 6.0vw; font-style: italic; background: #f5f5f5; } .reveal blockquote:before{ position: absolute; top: 0.1vw; left: 1vw; content: "\f10d"; font-family: FontAwesome; color: #2980b9; font-size: 3.0vw; } /* font size */ .reveal h1 {font-size: 5.0vw;} .reveal h2 {font-size: 4.0vw;} .reveal h3 {font-size: 2.8vw;} .reveal h4 {font-size: 2.6vw;} .reveal h5 {font-size: 2.4vw;} .reveal h6 {font-size: 2.2vw;} .reveal section, .reveal table, .reveal li, .reveal blockquote, .reveal th, .reveal td, .reveal p {font-size: 2.2vw;} .reveal code {font-size: 1.6vw;} /* new color */ .red {color: #EE6557;} .blue {color: #16A6B6;} /* split slide */ #right {left: -18.33%; text-align: left; float: left; width: 50%; z-index: -10;} #left {left: 31.25%; text-align: left; float: left; width: 50%; z-index: -10;} /* figure alignment */ img[alt$=">"] {float: right;} img[alt$="<"] {float: left;} img[alt$="><"] { display: block; max-width: 100%; height: auto; margin: auto; float: none!important; } </style> <style> /* specific design */ .reveal h2 { padding: 0 1.5vw; margin: 0.0vw 0 2.0vw -2.0vw; border-left: solid 1.2vw #008080; /* #2980b9 */ /* border-bottom: solid 0.8vw #d7d7d7; */ } </style> # KSG mutual information estimate <br><br> #### 2020.03.16 ### Shunsuke Kamiya --- ## Contents * Mutual information * Kozachenko-Leonenko entropy estimator * Kraskov-Stogbauer-Grassberger MI estimator --- ## Goal * **<font color="teal">Mutual information</font>** in **continuous** variables $$ \int \int dxdy P(x,y) \log \frac{P(x)P(y)}{P(x,y)} $$ * Estimating MI is not always easy. * How to estimate MI? --- ## Conventional methods * Cumulant expansions in ICA – very crude * Entropy maximalization in ICA – very crude * Histogram method – inaccurate $$ \boldsymbol{\downarrow} $$ * Kraskov utilised $k$ nearest neighbor method to overcome these problems --- ## 3H estimator * Decomposition to entropies $$ I(X,Y) = H(X) + H(Y) - H(X,Y) $$ * Entropy can be approximated by $k$-nearest neighbor --- ## Kozachenko-Leonenko entropy estimator Before going into MI, an approximation of entropy is suggested in 1987 utilizing $k$ nearest neighbor method. <div style="text-align: left;padding : 10px; border: solid 3px teal;"> $$ \hat{H}(X) = -\psi(k) + \psi(N) + \log c_D + \frac{D}{N}\sum_{i=1}^N \log \epsilon(i) $$ </p> </div> <br> * $X$: random variable with dimension $D$ * $\psi$: digamma function, $\psi(z) = d\log \Gamma(z)/dz$ * $N$: number of data points * $c_D$: volume of unit sphere in dimension X * $\epsilon(i)$: $\times 2$ distance from $x_i$ to its $k$'th neighbor --- ## Details on KL estimator * $X$: **continuous** random varialbe with values in some metric space (= there is a distance $||x-x'||$) * **<font color="teal">Shannon's (differential) entropy</font>**: where $\mu(x)$ is the probability density, $$ H(X) = - \int dx \mu(x) \log \mu(x) \tag{1} $$ * Want to estimate $H(X)$ from a random sample $(x_1, x_2, ..., x_N)$ * (1) $\rightarrow$ average of $\log \mu(x)$. So with unbiased estimators $\widehat{\log \mu(x)}$, we can have an unbiased estimator of the entropy $$ \hat{H}(X) = -N^{-1}\sum_{i=1}^N \widehat{\log\mu(x_i)} $$ --- In order to get the estimate $\widehat{\log \mu(x_i)}$, we set * $P_k(\epsilon)$: probability distribution of distance of $k$'th nearest neighbor of $x_i$ being $\epsilon/2$ * $P_k(\epsilon) d\epsilon$: the chance that, from $x_i$, there are * $k-1$ points within $\epsilon$-ball * one point within distance $r \in [\epsilon/2, \epsilon/2+d\epsilon/2]$ * the other $N-k-1$ outside of $\epsilon$-ball * $p_i(\epsilon)$: possibility of $x_i$ being its neighbor, mass of the $\epsilon$ ball centered at $x_i$, $p_i = \int_{||\xi - x_i||<\epsilon/2}d\xi\mu(\xi)$ --- Then one can have $$ P_k(\epsilon)d\epsilon = \frac{(N-1)!}{(k-1)!(N-k-1)!} \cdot dp_i(\epsilon)\cdot p_i(\epsilon)^{k-1}\cdot (1-p_i(\epsilon))^{N-k-1} $$ or, by denoting $p_i(\epsilon) = p_i$, $$ P_k(\epsilon) = \frac{(N-1)!}{(k-1)!(N-k-1)!} \frac{dp_i}{d\epsilon} p_i^{k-1} (1-p_i)^{N-k-1} $$ * One can check this is normalized applying formula on beta function (note when $\epsilon$ is from $0$ to $\infty$, $p_i$ is from $0$ to $1$) $$ \small \begin{eqnarray} \int_o^{\infty} d\epsilon P_k(\epsilon) &=& \frac{(N-1)!}{(k-1)!(N-k-1)!} B(k, N-k) \\ &=& \frac{(N-1)!}{(k-1)!(N-k-1)!} \frac{(k-1)!(N-k-1)!}{(N-1)!} = 1 \end{eqnarray} $$ --- Now, by assuming $\mu(x)$ to be constant in the entire $\epsilon$ ball, one gets $$ p_i \sim c_D\epsilon(i)^D \mu(x_i) \tag{2} $$ * $D$: dimension of $x$ * $\epsilon(i)$: $\times 2$ distance of $k$'th NN of $x_i$ * $c_D$: volme of $d$-dimensional unit ball ($c_D=1, \frac{\pi ^{D/2}}{\Gamma(1+D/2)\cdot 2^D}$ for maximum norm and for Euclidian norm, respectively) <br> In order to have the unbiased value of the mean of $\log \mu(x)$, we consider taking the $\log$ of eqation (2). --- Taking $\log$ of (2) will give $$ \log \mu(x_i) \sim \log p_i - D \cdot \log \epsilon(i)-\log c_D $$ <!-- Its mean of $i = 1,2, \cdots , N$ is described as (\epsilon should be the function of i) $$ \textrm{E}(\log \mu(x) ) \sim \textrm{E}(\log p_i) - D \cdot \textrm{E}(\log \epsilon(i))-\log c_D $$ the left hand side is **none other than the entropy estimator** we wanted --> <br> *<font color="red">My question: why can we take discrete & continuous mean on both hands? &rarr; by averaging with $N$, this value is the estimate of <b>differential</b> entropy </font>*<br><br> --- Lastly, one can get the expectation of $\log p_i(\epsilon)$ by following;<br> <font color="red">(presumably making an approximaation of $\log p_i$)</font> $$ \begin{eqnarray} &&\mathbb{E}_{P_k(\epsilon)}(\log p_i) = \int_o^\infty d\epsilon P_k(\epsilon)\log p_i(\epsilon) \\ &=& k \pmatrix{N-1\\k} \int_0^1 dp p^{k-1}(1-p)^{N-k-1}\log p \\ \\ &=& \psi(k) - \psi(N) \tag{*} \\ &=& H_{N-1} - H_{k-1} = \sum_{l=k} ^{N-1} \frac{1}{l} \end{eqnarray} $$ where $H_{n}$ is the harmonic series, $\displaystyle H_n = \sum_{l=1} ^n \frac{1}{l}$. <br>(Equation (\*) is to be proved in the Appendix.) --- Substitute this result and get $$ \log \mu(x_i) \sim \psi(k) - \psi(N) - D \cdot \log \epsilon(i)-\log c_D $$ Lastly, take average through $i=1,\cdots,N$ to give **Kozachenko-Leonenko entropy estimator**;<br><br> <div style="text-align: left;padding : 10px; border: solid 3px teal;"> <u><b> KL estimator (repeated)</b></u> <br/> $$ \hat{H}(X) = -\psi(k) + \psi(N) + \log c_D + D \langle \log \epsilon_k (i) \rangle_{i=1,\cdots,N} $$ where $\epsilon_k (i)$ is $\times 2$ distance from $x_i$ to its $k$'th neighbor. <p/> </div> <br> * KL estimator: $p_i$ approximated as $p_i \sim \exp(\psi(k))/\exp(\psi(N))$ * This is slightly different from the ordinary $k$ nearest neighbor method which approximates $p_i$ as $p_i \sim k/N$ --- ## KSG Mutual information estimator With the same basic idea ($k$ nearest neighbor) and some modifications, one can get estimation for mutual infomation. <div style="text-align: left;padding : 10px; border: solid 3px teal;"> $$ \hat{I}(X;Y) = \psi(k) - \langle \psi(n_x+1) + \psi(n_y+1) \rangle_N + \psi(N) $$ </p> </div> <br> * $n_x(i)$: number of points whose distane from $x_i$ is strictly less than that of $k$'th nearest neighbor point in joint spae of $X$ and $Y$ * $n_y(i)$: same as above except $x \rightarrow y$ --- ## Details on KSG MI estimator We now think of extending KL estimator in a high dimension, by jointing two variables $X$ and $Y$ as $Z = (X,Y)$ $$ \hat{H}(X,Y) = -\psi(k) + \psi(N) + \log (c_{D_X}c_{D_Y}) + \frac{D_X+D_Y}{N}\sum_{j=1}^N \log \epsilon _k(j) $$ *<font color="red">My question: $c_{D_{X,Y}}=c_{D_X}c_{D_Y}$ holds only when $c_D$ is one dimensional?</font>* --- In order to get $I(X;Y)$, we have to substitute this value in $$ I(X,Y) = H(X) + H(Y) - H(X,Y) $$ <!-- However, to do this we have to use different distane scales in joint & marginal spaces. Distance to the kth neighbor in joint space will be larger than the distances to the neighbors in the marginal spaces. The bias in Eq. (20) results from the nonuniformity of the density. Since the effect of the latter depends of course on the kth neighbor distances, the biases in Hˆ 􏰒X􏰇, Hˆ 􏰒Y􏰇, and in Hˆ 􏰒X,Y􏰇 would be very different and would thus not cancel.--> To keep bias small, consider letting the last term $\frac{D}{N}\sum_{j=1}^N \log \epsilon_k(j)$ cancel.<br><br> Let's say, the $k$'th Chebycev NN of a point $z_i = (x_i, y_i)$, whose distance is $\epsilon_k/2$, is the difference of $x$ coordinates. And let the $k$'th Chebycev NN be $k_x(i)$'th NN in $x$ coordinate. $k_x(i)$ is described as $k_x(i) = n_x(i) +1$ if there are altogether $n_x(i)$ points in the range from $x=x_i - \epsilon_k/2$ to $x=x_i + \epsilon_k/2$ ![image alt >](https://i.imgur.com/v1us4au.png) --- Then the KL estimator is re-written as $$ \hat{H}(X) = -\psi(n_x(i)+1) + \psi(N) + \log c_{D_X} + \frac{D}{N}\sum_{j=1}^N \log \epsilon_k(j) $$ Taking average in $i = 1,\cdots,N$ will give $$ \hat{H}(X) = -\langle \psi(n_x(i)+1) \rangle _{i=1,\cdots,N} + \psi(N) + \log c_{D_X} + D \langle \log \epsilon_k(j) \rangle_{j = 1,\cdots,N} $$ <font color="red"><i> My question: I don't quite get why we have to take an average, maybe to reduce the error by adding information when $k$ is large, when $n_y(i) \rightarrow \infty$?? </i> </font><br> --- $\hat{H}(Y)$: although $\epsilon/2$ is not the distance of the $(n_y(i)+1)$'st nearest neighbor, we let that to be true (this approximation is validated when $n_y(i)\rightarrow \infty$ and thus also when $N \rightarrow \infty$, Kraskov says in paper) $$ \hat{H}(Y) \sim-\langle \psi(n_y(i)+1) \rangle _{i=1,\cdots,N} + \psi(N) + \log c_{D_Y} + D \langle \log \epsilon_k(j) \rangle_{j = 1,\cdots,N} $$ Substituting this to get the following equation;<br><br> <div style="text-align: left;padding : 10px; border: solid 3px teal;"> <u><b> KSG MI estimator (repeated)</b></u> <br/> $$ \hat{I}(X;Y) = \psi(k) - \langle \psi(n_x+1) + \psi(n_y+1) \rangle_N + \psi(N) $$ </p> </div> --- There are two approximations in KSG estimator; 1. $p_i \sim c_D\epsilon(i)^D\mu(x_i)$ 1. Letting $\epsilon$ the distance of the same point in two coodinates <br><br> Although errors in $\hat{H}(X)$,$\hat{H}(Y)$, and $\hat{H}(X,Y)$ usually do not cancel, the above method is better than when we used different length sales in the three estimations <!-- KL estimator can be inaccurate when * the **dimensionality** is high * the density in the region inside the $\epsilon$-ball ($||\xi - x_i||<\epsilon/2$) is highly **non-uniform** (with high variability) --> --- ## Application 1. Gomez-Herrero's Entropy combination Let $V$ denote random variable, then an **entropy combination** is defined $$ C(V_{\mathcal{L}_1, \cdots ,V_{\mathcal{L}_p}}) = \sum_{i=1}^p s_i H(V_{\mathcal{L}_i}) - H(V) $$ With this framework, one can apply KL estimator not only to MI but to **conditional MI** and **transfer entropy**/**conditional transfer entropy**. $$ \hat{I}(X;Y|Z) = \psi(k) - \langle \psi(k_{xz}(n)+\psi(k_{zy}(n))-\psi(k_{z}(n))) \rangle_n \\ \hat{T}(X;Y) = \psi(k) - \langle \psi(k_{wx}(n)+\psi(k_{xy}(n))-\psi(k_{x}(n))) \rangle_n \\ \hat{T}(X;Y|Z) = \psi(k) - \langle \psi(k_{wxz}(n)+\psi(k_{xzy}(n))-\psi(k_{xz}(n))) \rangle_n $$ --- ## Application 2. Gao's MI in discrete-continuous mixture Gao --- ## Appendix. Proof of equation (\*) We want to proove equation (\*), which is $$ k \pmatrix{N-1\\k} \int_0^1 dp p^{k-1}(1-p)^{N-k-1}\log p = \psi(k) - \psi(N) $$ Let us denote $$ I_{N,k} = \int_0^1 dp p^{k-1}(1-p)^{N-k-1}\log p $$ We are going to find the relationship between $I_{N,k}$ and say, $I_{N,k+1}$ --- By parial integral $$ \begin{eqnarray} && I_{N,k} = \int_0^1 dp p^{k-1}(1-p)^{N-k-1}\frac{d}{dp}(p\log p - p) \\ &=& \left[ p^{k-1}(1-p)^{N-k-1} (p\log p - p) \right]_0^1 - \int_0^1 dp \frac{d}{dp}\left( p^{k-1}(1-p)^{N-k-1} \right)(p\log p - p) \\ &=& - \int_0^1 dp \{ (k-1) p^{k-2}(1-p)^{N-k-1} - (N-k-1) p^{k-1}(1-p)^{N-k-2} \} (p\log p - p)\\ &=& - (k-1)I_{N,k} + (N-k-1)I_{N,k+1} - (k-1)B(k,N-k) \\ && \hskip 13em + (N-k-1)B(k+1,N-k-1) \tag{A1} \end{eqnarray} $$ where $B(a,b)$ is the beta function. --- Consider $$xB(x,y+1) = yB(x+1,y) $$ and by letting $x=k, y=N-k$, $$ \begin{eqnarray} &&-(k-1)B(k,N-k) + (N-k-1)B(k+1,N-k-1) \\ &=& - (k-1)B(k,N-k) + (N-k-1)\cdot \frac{k}{N-k-1} \cdot B(k,N-k) \\ &=&B(k,N-k) \end{eqnarray} $$ We substitute this to (A1) to get $$ I_{N,k} = - (k-1)I_{N,k} + (N-k-1)I_{N,k+1} + B(k,N-k) $$ --- Thus $$ k I_{N,k} = (N-k-1)I_{N,k+1} - B(k,N-k) $$ And using this, one can proove (\*) by mathematical induction **Mathematical induction**. Let us fix N, and assume (\*) to be true in $k=k$. Let us consider $k \rightarrow k+1$ $$ \begin{eqnarray} &&(k+1) \pmatrix{N-1\\k+1} I_{N,k+1} \\ &=&(k+1) \pmatrix{N-1\\k+1} \frac{1}{N-k-1} \left( kI_{N,k} + B(k,N-k) \right) \\ &=& (k+1) \pmatrix{N-1\\k+1} \frac{1}{N-k-1} \left( k \cdot \frac{\sum_{l=k}^{N-1}\frac{1}{l}}{k\pmatrix{N-1\\k}}+ B(k,N-k) \right) \end{eqnarray} $$ --- ## References Kraskov 2004 Takashina Estimating Mutual Information for Discrete‐Continuous Mixtures 離散・連続混合の相互情報量の推定 2017 @SlideShare Gomez-Herrero 2010 Gao 2017
{"metaMigratedAt":"2023-06-15T05:16:33.134Z","metaMigratedFrom":"YAML","title":"Kraskov","breaks":true,"slideOptions":"{\"theme\":\"white\",\"slideNumber\":\"c/t\",\"center\":false,\"transition\":\"none\",\"keyboard\":true,\"width\":\"93%\",\"height\":\"100%\"}","contributors":"[{\"id\":\"6c59db73-7c46-4cee-b4fd-a484c59e2ca5\",\"add\":21103,\"del\":6159}]"}
    544 views