---
tags: information theory
---
# Basic Properties of Mutual Information
## Definition
### Mutual Information
$X$, $Y$: random variables
Define the mutual information between $X$ and $Y$, denoted by $I(X:Y)$, as
$$
I(X:Y) = H(X) - X(X|Y)
$$
### Condition Mutual Inforations
$X$, $Y$, $Z$: random variables
Then we define $I(X;Y|Z)$ as
$$
\begin{align}
I(X;Y|Z) &:= H(X|Z) - H(X|Y,Z)
\notag \\
& = -\sum_{z \in Z} P(z) I(X:Y|Z=z)
\end{align}
$$
### Remark
$$
I(X:Y) = H(X) - H(X|Y) = H(X,Y) - H(X) - H(Y) = I(Y:X)
$$
Similarly,
$$
I(X:Y|Z) = I(Y:X|Z)
$$
## Basic Properties
### Proposition 1
$$
I(X:Y) \ge 0
$$
and
$$
I(X:Y) \le H(X)
$$
### Proposition 1
$I(X:Y|Z) = 0$ if and only if $X, Y$ are Independent given $Z$, that is, $X-Z-Y$ [forms a Markov chain](https://hackmd.io/40iNuBaSTrmnjdgwWIfb1g#Corollary2).
#### proof
$(\Leftarrow)$
$$
\begin{align}
I(X:Y|Z) &= I(Y:X|Z)
\notag \\
&=
H(Y|Z) - H(Y|X,Z)
\notag \\
&=
H(Y|Z) - \sum_{x, z} p(x, z) H(Y|X=x, Z=z)
\notag\\
&=
H(Y|Z) + \sum_{x, z} p(x,z) \sum_{y} \underbrace{p(y|X=x, Z=z)}_{p(y|Z=z)} \log p(y|X=x, Z=z)
\notag \\
&=H(Y|Z) + \sum_{x,z} p(x,z) \sum_y p(y|Z=z) \log p(y|Z = z)
\notag \\
&=
H(Y|Z) - H(Y|Z) = 0
\end{align}
$$
### Theorem 1 (Chain Rule)
$$
\begin{align}
I(X:Y^n) = I(X: Y_1,\ldots, Y_n) = \sum_{i=1} I(Y_i:X|Y^{i-1})
\end{align}
$$
#### proof
$$
\begin{align}
I(X: Y^n)
&=
I(Y^n:X)
\notag \\
&=
H(Y^n) - H(Y^n|X)
\notag \\
&=
H(Y_n | Y^{n-1}) + H(Y_{n-1} | Y^{n-1}) + \cdots + H(Y_2 | Y_1) + H(Y_1)
\notag \\
& - H(Y_n | Y^{n-1}, X) - H(Y_{n-1}|Y^{n-2}, X) - H(Y_2 | Y_1, X) - H(Y_1|X)
\notag \\
&= I(Y_n:X |Y^{n-1}) + I(Y_{n-1} :X | Y^{n-2}) + \cdots + I(Y_2:X|Y_1) + I(Y_1:X)
\end{align}
$$
### Theorem 2 (Information Processing)
For a Markov chain $X-Y-Z$, that is $P(X,Y,Z) = P_X P_{Y|X} P_{Z|Y}$, we have
$$
I(X:Y) \ge I(X:Z)
$$
#### proof
Since $X-Y-Z$, we have $I(X:Z|Y) = 0$. Hence we can apply chain rule to get
$$
I(X:Y,Z) = I(X:Z|Y) + I(X:Y) = I(X:Y)
$$
and
$$
I(X:Y,Z) = I(X:Y|Z) + I(X:Z)
$$
Therefore,
$$
I(X:Y) = I(X:Y|Z) + I(X:Z) \ge I(X:Z)
$$
### Exercise
For $Z= g(Y)$ being a deterministic function of $Y$, we have
$$
H(Y) \ge H(Z)
$$
and
$$
I(X:Y) \ge I(X:Z)
$$
#### proof
Hint: Use the fact that $-(p_1 + p_2) \log (p_1 + p_2) \le -(p_1 \log p_1 + p_2 \log p_2)$.
### Exercise
$$
X_1 - X_2 - X_3 - X_4 \Rightarrow I(X_1;X_4) \le I(X_2; X_3)
$$
#### proof idea
From theorem 2 we have:
$$
\begin{align}
\because X_2 - X_3 - X_4
\Rightarrow I(X_2 :X_3) \ge I(X_2:X_4)
\end{align}
$$
Thus, we only need to show that $I(X_2:X_4) \ge I(X_1:X_4)$. Therefore, it will be sufficient if we prove that $X_1 - X_2 -X_4$; that is,
$$
P(X_4|X_2, X_1) = P(X_4|X_2) \times P(X_2 | X_1) \times P(X_1)
$$
To prove this, you can follow [this](https://hackmd.io/40iNuBaSTrmnjdgwWIfb1g#Theorem-1).
#### Remark
For the following communication system:

,where $W-X^n-Y^n-\tilde W$, $I(W:\tilde W)$ is bounded by $I(X^n; Y^n)$, which is channel capacity (defined later).
### Proposition3
For a Markov Chain $X-Y-Z$, we have $I(X:Y|Z) \le I(X:Y)$
#### proof
$$
\begin{align}
I(X:Y|Z)
&=
H(X|Z) - H(X|Y,Z)
\notag \\
&= H(X|Z) - H(X|Y)
\notag \\
& \le
H(X) - H(X|Y) = I(X:Y)
\end{align}
$$
#### Remark
In the case of Markov process, condition reduces entropy. However, this is not true in general.