Opening the black box of Deep Neural Networks via Information

# Opening the black box of Deep Neural Networks via Information Paper: https://arxiv.org/pdf/1703.00810 Repo: https://github.com/karanprime/IB_Mala (needs cleaning up) # Goals and open questions - [ ] Verify MI properties (Karan) - [ ] Verify some invariances - [ ] eg. rotations of X, rotations of Y shouldn't change MI - [ ] Verify numerical values of MI on artificial datasets with known entropy - [ ] eg. X - $n$-dimensional random normal vector, Y - # of dimensions of X with positive values - [ ] Verify numerical values of MI on discrete datasets with known MI - [ ] Verify whether GMMs fit the distribution "well" (Bartek) - [x] eg. split data into train and test, log-likelihood of data should be similar for both parts of the data - [ ] Evaluate MI as a method of hyperparameter selection - [ ] On artificial datasets with known solution - [ ] eg. an X vector with $p_1$ fraction of useful data concatenated with $p_2$ fraction of random noise (the correct solution should be $p_1=1, p_2=0$) - [ ] Other, less trivial datasets - [ ] On artificial datasets without a trivial solution - [ ] Correlate MI score with NN performance - [ ] On some real datasets - [ ] MALA: X=Bispectrum descriptors, Y=LDOS - [ ] Correlate MI score with NN performance - [ ] Other real world dataset - [ ] Evaluate MI on MNIST images and not/partially/fully trained embeddings (Petr) - [ ] How can we use the Information Plane to guide training? - [ ] Reproduce Information Plane (Karan) - [ ] How can we use gradient SNR to guide training? # Results ## Properties verified: - Invariant to scaling - GMMs fit the data well (the lines are perfectly overlapping) - ![image](https://hackmd.io/_uploads/ryFaqb0iC.png) ## Some plots ![image](https://hackmd.io/_uploads/ryfqSWwB0.png) ![image](https://hackmd.io/_uploads/HJ1hSbwS0.png) ![image](https://hackmd.io/_uploads/SyamSbwrC.png) ![image](https://hackmd.io/_uploads/SyQHBWDBR.png) ## ## Notation $B$ - bispectrum vector $D$ - LDOS vector $T_i$ - Activation vector of a particular layer $I(X, Y)$ - Mutual Information $H(X)$ - Entropy ## Definitions and notes $$I(X; Y) = D_{KL}(p(x, y) \parallel p(x)p(y)) = \int_{X} \int_{Y} p(x, y) \log \left( \frac{p(x, y)}{p(x) p(y)} \right) dy\, dx$$ To evaluate $p(x, y), p(x), p(y)$ we need to assume (and fit) some distribution of $X$ and $Y$. Integrals $\int_{X} dx, \int_{Y}dx$ require (in the case of Monte Carlo estimation) sampling from that distribution. ### Information plane: Pairs $(I(X, T_i), I(T_i, Y))$ on the plane ![image](https://hackmd.io/_uploads/HyWAJqjXC.png) # Open Questions: - Can we use $I(B, T_i) \approx I(B, D)$ as a decision rule informing us we have enough data? - Can we calculate the entropy of the LDOS vector $H(D)$? - How to use the gradient SNR? - To guide training - To guide hyperparameter search - Is the diffusion phase useful/important? - How to use the coordinates in the Information Plane to guide training? # Closed questions - How to compute Mutual Information? - Naftali Tishby discretized the activation variables, which was possible due to small size of layers - How to scale to bigger networks? - Model activations as a Gaussian? - You can compute KL-divergence between Gaussians - Diagonal covariance matrix? - Full cov matrix impossible - PCA, then "full" cov matrix - PCA scales $O(n^3)$ - There is Iterative PCA, which computes only most significant components - Input (bispectrum descriptors) and output (LDOS) possible to be modelled as full covariance matrix gaussians due to "small" size (p<300) - Discretize activations, but in a smart way? - Clustering? - Gaussian Mixture - KL-divergence between GMs intractable analytically - Estimation of KL-D possible with Monte Carlo sampling - How to use the coordinates in the Information Plane? - To guide hyperparameter search - Can we test which descriptor dimensions are informative about the LDOS? - Would evaluating $I(B_i, D)$ make sense? - Maybe we could evaluate $I(B, D)$ while increasing the dimension of $B$. - An optimal descriptor is when $I(B, D)$ saturates to $H(D)$ ## GMMs - **Gaussian Mixture Models (GMMs)** are probabilistic models that represent a distribution as a combination of multiple Gaussian distributions. - Each component $k$ in the mixture has its own mean $\mu_k$, covariance matrix $\Sigma_k$, and weight $\pi_k$. - The probability density function of a GMM is given by: $$ p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k) $$ where: - $K$ is the number of Gaussian components. - $\pi_k$ are the mixing coefficients such that $\sum_{k=1}^{K} \pi_k = 1$. - $\mathcal{N}(x | \mu_k, \Sigma_k)$ is the Gaussian distribution with mean $\mu_k$ and covariance $\Sigma_k$. - As $K \rightarrow \infty$ GMMs can fit any distribution - **Relevance to Mutual Information**: - GMMs can be used to model the distributions of neural network activations.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.