# Opening the black box of Deep Neural Networks via Information
Paper: https://arxiv.org/pdf/1703.00810
Repo: https://github.com/karanprime/IB_Mala (needs cleaning up)
# Goals and open questions
- [ ] Verify MI properties (Karan)
- [ ] Verify some invariances
- [ ] eg. rotations of X, rotations of Y shouldn't change MI
- [ ] Verify numerical values of MI on artificial datasets with known entropy
- [ ] eg. X - $n$-dimensional random normal vector, Y - # of dimensions of X with positive values
- [ ] Verify numerical values of MI on discrete datasets with known MI
- [ ] Verify whether GMMs fit the distribution "well" (Bartek)
- [x] eg. split data into train and test, log-likelihood of data should be similar for both parts of the data
- [ ] Evaluate MI as a method of hyperparameter selection
- [ ] On artificial datasets with known solution
- [ ] eg. an X vector with $p_1$ fraction of useful data concatenated with $p_2$ fraction of random noise (the correct solution should be $p_1=1, p_2=0$)
- [ ] Other, less trivial datasets
- [ ] On artificial datasets without a trivial solution
- [ ] Correlate MI score with NN performance
- [ ] On some real datasets
- [ ] MALA: X=Bispectrum descriptors, Y=LDOS
- [ ] Correlate MI score with NN performance
- [ ] Other real world dataset
- [ ] Evaluate MI on MNIST images and not/partially/fully trained embeddings (Petr)
- [ ] How can we use the Information Plane to guide training?
- [ ] Reproduce Information Plane (Karan)
- [ ] How can we use gradient SNR to guide training?
# Results
## Properties verified:
- Invariant to scaling
- GMMs fit the data well (the lines are perfectly overlapping)
- 
## Some plots




##
## Notation
$B$ - bispectrum vector
$D$ - LDOS vector
$T_i$ - Activation vector of a particular layer
$I(X, Y)$ - Mutual Information
$H(X)$ - Entropy
## Definitions and notes
$$I(X; Y) = D_{KL}(p(x, y) \parallel p(x)p(y)) = \int_{X} \int_{Y} p(x, y) \log \left( \frac{p(x, y)}{p(x) p(y)} \right) dy\, dx$$
To evaluate $p(x, y), p(x), p(y)$ we need to assume (and fit) some distribution of $X$ and $Y$.
Integrals $\int_{X} dx, \int_{Y}dx$ require (in the case of Monte Carlo estimation) sampling from that distribution.
### Information plane:
Pairs $(I(X, T_i), I(T_i, Y))$ on the plane

# Open Questions:
- Can we use $I(B, T_i) \approx I(B, D)$ as a decision rule informing us we have enough data?
- Can we calculate the entropy of the LDOS vector $H(D)$?
- How to use the gradient SNR?
- To guide training
- To guide hyperparameter search
- Is the diffusion phase useful/important?
- How to use the coordinates in the Information Plane to guide training?
# Closed questions
- How to compute Mutual Information?
- Naftali Tishby discretized the activation variables, which was possible due to small size of layers
- How to scale to bigger networks?
- Model activations as a Gaussian?
- You can compute KL-divergence between Gaussians
- Diagonal covariance matrix?
- Full cov matrix impossible
- PCA, then "full" cov matrix
- PCA scales $O(n^3)$
- There is Iterative PCA, which computes only most significant components
- Input (bispectrum descriptors) and output (LDOS) possible to be modelled as full covariance matrix gaussians due to "small" size (p<300)
- Discretize activations, but in a smart way?
- Clustering?
- Gaussian Mixture
- KL-divergence between GMs intractable analytically
- Estimation of KL-D possible with Monte Carlo sampling
- How to use the coordinates in the Information Plane?
- To guide hyperparameter search
- Can we test which descriptor dimensions are informative about the LDOS?
- Would evaluating $I(B_i, D)$ make sense?
- Maybe we could evaluate $I(B, D)$ while increasing the dimension of $B$.
- An optimal descriptor is when $I(B, D)$ saturates to $H(D)$
## GMMs
- **Gaussian Mixture Models (GMMs)** are probabilistic models that represent a distribution as a combination of multiple Gaussian distributions.
- Each component $k$ in the mixture has its own mean $\mu_k$, covariance matrix $\Sigma_k$, and weight $\pi_k$.
- The probability density function of a GMM is given by:
$$
p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)
$$ where:
- $K$ is the number of Gaussian components.
- $\pi_k$ are the mixing coefficients such that $\sum_{k=1}^{K} \pi_k = 1$.
- $\mathcal{N}(x | \mu_k, \Sigma_k)$ is the Gaussian distribution with mean $\mu_k$ and covariance $\Sigma_k$.
- As $K \rightarrow \infty$ GMMs can fit any distribution
- **Relevance to Mutual Information**:
- GMMs can be used to model the distributions of neural network activations.