Revision - HackMD

# Revision We thank reviewers for their helpful comments. We also appreciate for considering our responses and increasing their scores. As reviewers suggested, we uploaded a revision of the enhanced draft based on reviewers' comments. The major change is the experimental results to address reviewers' concerns. Other changes are highlighted in blue. # General response We are grateful for thoughtful reviews and appreciate all reviewers' detailed and thorough points. This excerpt from the reviewer `QgTn` response captures the primary motivation and contribution of this work > - This is a useful result to add to the literature on characterizing the distribution of the features of a deep network with random weights. > - This is the first non-asymptotic Gaussian approximation for finite-width deep neural networks with non-linear activation. > - Extends previously known cases to include non-linear activations (sub-linear and odd function as well as ReLU) > - Validity of Theorem 7, 8 extends previously known results in interesting dimensions (either to finite width or incorporating non-linearity). We will highlight that the established entropy bound has theoretical and practical applications: - The following experiments demonstrate that entropy at initialization influences learning with gradient descent. - The following experiments show entropy is determined by neural architecture and it is not a direct consequence of random weights. - Self-normalized activations, which enhance the training of deep nets, are designed for high entropy representations [Klambauer, Unterthiner, Mayr, and Hochreiter]. - Gaussian representations have been leveraged in theoretical studies of neural networks such as [De Palma, G., Kiani, B. T., and Lloyd, S. (2019)]. Now, we show how entropy entails in the theoretical and practical sense. **Experiments** The main rationale for quantifying differential entropy of representations is to capture uncertainty. Note that $H_\ell\in R^{d\times n}$ comprises of representations for $n$ samples in the minibatch in layer $\ell$. We experimentally demonstrate the role of entropy in learning neural networks for binary classification. **Settings** In binary classification setting $\{(x_i,y_i)\}_{i\le N}$ with input $x_i\in R^d$ and labels $y_i:=\{0,1\}$, processed in batches of size $n$. To map the last layer to binary output, we can multiply the last layer by a $2\times d$ fully connected layer and take arg-max: $$Q:=\left(argmax_{j\in\{0,1\}} H^{out}_{ji}\right)_{i\le n},\qquad H^{out}=W^{out} H_\ell, \qquad W^{out} \in R^{2\times d}, W^{out}_{ij}\sim N(0,1)$$ Given model outputs $q_1,\dots,q_N$, with sample distribution $p(j):=\frac1N\#\{i:q_i=j\}$ define label and model entropy as $$Ent_{model}:=-\sum_{j\in \{0,1\}} p(j)\log_2 p(j).$$ **Experiment 1: Neural arichecture and entropy** We can state a remarkable dichotomy between MLP without batch normalization and BN-MLP (MLP with batch norm): - **MLP**: In a vanilla MLP, for sufficiently large $n,d$ Lemma 3 of ([Kohler et al.]) proves that stable rank of representations converges to $1$ which yields $\lim_{\ell\to \infty} Ent_{MLP} = 0$ - **BN-MLP**: Invoking Theorem 8 and Lemma 9 for BN-MLP, eigenvalues of $H_\ell^\top H_\ell$ concentrates around eigenvalues of $C_*$ with high probability for sufficiently large $n,d, \ell$. For old activations, Proposition 5 implies $\lim_{\ell\to\infty} Ent_{BN-MLP}=\Theta(1)$ with high probability. We experimentally validate the above difference. We use $0,1$ digits from the MNIST training dataset to adhere to the binary classification setting. The following result ($n=32, d=500, F=ReLU$, average over $50$ independent runs), demonstrates vanishing entropy for vanilla and high entropy for BN-MLP: **Table 1** |$\ell$|$Ent_{MLP}$ | $Ent_{\text{BN-MLP}}$ | |-- | -- |-- | |2|0.67|0.82| |5|0.52|0.82| |10|0.18|0.82| |15|0.05|0.85| Consistent with reports by (Bjorck et al., 2018), batch normalization improves entropy at initialization for all activations ($d=200,n=32,\ell=20$, averaged over $20$ independent runs): **Table 2** |$F(x)$|$Ent_{MLP}$|$Ent_{\text{BN-MLP}}$| |--|--|--| |$x$|0.91|0.99| |$\tanh(x)$|0.80|0.97| |$ReLU(x)$|<0.01|0.96| **Experiment 2: Entropy and backpropagation** Previous works (Bjorck et al., 2018, Kohler et al., 2020) connect BN layers to enhanced optimization with batch normalization. We can view the optimization from the perspective of label uncertainty by tracking model entropy during optimization with SGD (averaged over $5$ independent runs for $\ell=15, F=ReLU, d=200, n=16$, and learning rate $1e-3$). Remarkably, the vanilla network forces the SGD to spend some iterations to increase label entropy from the initial value of $0$, while the MLP with BN stars from a much higher $0.75$. Observe: The convergence rate well correlates with the entropy of representations. **Table 3** |epoch|$Ent_{MLP}$|$Acc_{MLP}$|$Ent_{\text{BN-MLP}}$|$Acc_{\text{BN-MLP}}$| |--|--|--|--|--| |0|<0.01|0.52|0.75|0.52| |2|<0.01|0.55|0.87|0.68| |4|<0.01|0.53|0.97|0.86| |6|0.37|0.63|0.99|0.94| |8|0.98|0.87|1.00|0.97| |10|1.00|0.98|1.00|0.99| We thank the reviewers again for their thoughtful and excellent feedback. We will add these experimental validations and experiments in individual responses to reviewers to the main text. # Reviewer vV9e > ... the paper purports to describe the effect of depth in random nets, but it's really about batchnorm. The title of the paper, and the first two sentences of the abstract, are quite misleading! ... Let us express our main reasons for highlighting the depth: 1. The second sentence of the abstract is not a claim; instead, it states a broad aim of understanding neural networks, and the following sentence stating our results narrows it down to " neural networks equipped with batch normalization." 2. Entropy maximization relies on the interplay between depth and batch normalization. Notably, a shallow network with batch normalization does not achieve the maximum entropy, as substantiated by experimental evidence (Table 1). > A second major point of presentation is that the authors choose entropy as their object of study. Why? Let us articulate the reasons for the narrative : 1. This excerpt from reviewer `QgTn` response captures our motivation entropy formulation well: `Applying well-known fundamental principle of entropy maximization (using variational principle) in statistical physics to deep neural networks is novel and interesting perspective` 2. By measuring model uncertainty via entropy, we can better understand the behaviors of distributions as well as training of MLP and BN-MLP. It is not immediately clear how one can deduce these from Gaussianity. 3. Entropy maximization encapsulates all of our results into a unified notion of maximum entropy and entails a Gaussian matrix with independent elements (Lem. 6). > .. what you mean by "entropy" of representations early! Thank you for the pointer; we will clarify the theorem statement and the following paragraph. > - Figure 1 needs more explanation Details will be added to the caption. > "significantly enhances training" (page 2) should be made more specific We will add a reference to experimental validations, as well as those presented in section 5 of [Ref 309]. [Ref 309]. > - As written, the definition of alpha-contractivity makes it sound like the condition is required to hold for all alpha in (0,1). Thank you; we will revise the definition to remove the ambiguity. > the paper's centered on one technical result regarding entropy. It's not obvious why it matters, Implications are covered in the general response. > ... measure of Gaussianity increasing with depth, and show that it varies with, width, and batch size as promised! Please see experiments in reviewer `QgTn` response. > Explain more about what is $\alpha$ ... and how network parameters affect it. Experiments (Tables 4 and 5) in the response to Reviewer QgTn show how $\alpha$ depends on the network width and sample size. We will add the experiments and elaborate more on this dependency. > Discuss the scale of the weights at init ... The scaling is important only for networks without batch normalization. Notably, the output of batch normalization layers is invariant to the scaling of weights. More precisely, the chain representation obey $$H_{\ell+1} = W_{\ell+1}\left(\frac{1}{\sqrt{d}} F\circ BN(H_\ell)\right), [BN(H)]_{i.} = \frac{H_{[i.]} - mean[H_{[i.]}]}{\sqrt{Var(H_{i.})}}$$ Observe that scaling $W_{\ell}$ by $\beta$ will scale $H_\ell$ by $\beta$, but will not affect distribution of $BN(H_\ell)$ and $H_{\ell+1}$ because of BN: $$mean[\beta H_{i.}] = \beta mean[H_{i.}],\quad Var(\beta H_{i.})) = \beta^2 Var(H_{i.})$$ Our definition of maximum entropy and careful scalings ensure that the scaling of weights won't influence the result. > When I back-propagate ... [do] I expect the gradients to explode? Check **Table 3** in the general response. This experiment shows the effect of backprop on entropy: entropy increases with epochs. If representations at the beginning have a low entropy, then more backprops are needed to increase the entropy. > In Section ... I don't see how the subsequent analysis explains any of the listed behaviors. How does it? For clarity, [215-216] will be changed to "We demonstrate how our analysis extends the role of batch normalization in avoiding rank collapse to non-linear activations. " > Why is the 1/D scaling in Def 1 obviously right? The choice of scaling by $1/D$ is arbitrary, and one can re-state all entropy-related results by multiplying all sides of inequalities by $n d$. The $n d$ factor implies that a larger batch size and wider network will have more capacity for entropy. By formulating the theorem in normalized entropy, we intended to highlight the core dynamics instead of the trivial dependency of entropy on size. > What evidence can you provide for something like this happening in realistic settings? Covered in general response experimental validations. > ... authors' claims seem too big ... states that their work "[inspires] the design of neural architectures"... We will rephrase it to "from the perspective of entropy can guide the choice of width, depth, batch size, as well as activation functions." to limit its scope. # Reviewer m29v > ## Strengths And Weaknesses: > I have checked the technical details in parts and it seems sound. > This is a useful result to add to the literature on characterizing the distribution of the features of a deep network with random weights. > Please see the questions below. > ## Questions: > 1. In Theorem 7, how does the entropy of the features depend upon the entropy of the inputs? i.e., if we were to take inputs that are not very entropic, then Theorem 7 still predicts that the entropy of the features is near maximal. Is this accurate? Yes, this is the statement. We experimentally validated this in response to reviewer `QgTn`. > 2. To expand upon the previous question, the entropy of the representation is the same as that of the inputs if weights are a deterministic quantity. The weights are considered to be random in this paper, and therefore the entropy has to increase as successive layers add more randomness to the features, right? This is a general phenomenon: the entropy H(X_n) increases for any finite-state Markov chain X_1, X_2, ... X_n with an arbitrary initial distribution if the stationary distribution is uniform. Increasing entropy is not only due to the randomness of weights but also to batch normalization layers. Without batch normalization, entropy decreases with depth as all outputs get aligned in the same direction (see experimental results presented **Tabel 1** in the general response and reference in line 306). Indeed, the stationary distribution for another chain of representations with a stationary distribution has low entropy. We provide a non-asymptotic approximation for the stationary distribution for networks with batch normalization. This approximation concludes the established entropy bound. > In other words, I think the authors should do a better job of conveying why this result is meaningful. If the weights are random, then the network is not a good representation, so one cannot say that the "representation is being optimized implicitly, or that it obeys the principle of maximum entropy at initialization "as it is done in the abstract (if we initialize the activations to be Gaussian, then they also obey the principle of maximum entropy...). It is impossible to initialize hidden representations independently since they need to obey the recurrence imposed by layers. Non-linear transformations obtain the maximum entropy for hidden representation in all layers. Given that weights are Gaussian, the distribution of the representation is not necessarily Gaussian. For example, it is known that MLP with Gaussia weights without a normalization layer makes all the representations aligned in the same direction as they grow in depth (see **Table 1** in the general response for experimental validations). Also, the outputs of a linear neural network with random gaussian weight are determined by the product of random matrices, which is not Gaussian. We stress that the implicit optimization of entropy for representations differs from the explicit optimization of training loss with gradient descent. Despite stochasticity, stochastic processes may admit a variational principle, demonstrating their dynamics over time. For example, Ito processes admit a variational principle as they optimize a particular potential [Jordan 1998]. > 4. I would encourage the authors to reconsider the narrative on Lines 176-187. The IB principle is about learning representations that are relevant to the task, saying that deep networks obey the IB because the entropy of the features increases with depth is not accurate if one never talks about mutual information with respect to the task. Any random map would obey the IB if we think like this! We will declare that the IB principle coincides with the maximum entropy principle when the labels are hidden. At the beginning of training, we do not have access to the labels, so it is natural to obey the maximum entropy principle. > 5. It would be useful to see whether these results hold during training (where weights are not random), or how far they hold after initialization. Experiments in **Tabel** 3 in the general response shows that the entropy increases with gradient descent iterations. Starting from low entropy outputs wastes many epochs to increase the entropy. # Reviewer QgTn We thank the reviewer for their thorough and detailed comments. >The significance of the main claim is unclear .... Related to the previous point, the authors should expand more why knowing that the entropy maximizes as depth increases is meaningful or useful for deep neural networks. We will highlight the following applications of non-asymptotic gaussian approximation: 1. Gaussian representations are used for the design of self-normalized activations. [Klambauer, Unterthiner, Mayr, and Hochreiter. "Self-normalizing neural networks." Advances in neural information processing systems] design novel activations assuming the representations are gaussian. This careful design of activations effectively accelerates the learning of deep neural networks. The non-asymptotic bounds can be incorporated into this design to enhance the self-normalized activations. For example, one can design an activation that depends on the network width and depth leveraged by the established non-asymptotic Gaussian approximation. 2. In the general response, we elaborate on the connection between entropy and training with gradient descent. Experiments in the general response illustrate that entropy influences the performance of gradient descent. 3. Gaussian representations have been leveraged in theoretical studies of neural networks. For example, a recent study establishes the implicit bias of random networks toward simple functions [De Palma, G., Kiani, B. T., and Lloyd, S. (2019). Random deep neural networks are biased towards simple functions]. This paper relies on the theoretical regime of infinite width. Our bounds enable such studies to use an explicit Gaussian approximation for neural networks with a finite width used in practice. > In p5 (l176-187) puts entropy maximization in perspective of the information bottleneck principle. The interpretation here is that irrelevant input information is being removed, but the entropy maximization does not know any "relevance" without training. To the reviewer, one can remove information from inputs in many ways (just add large noise) but it's not clear why increasing depth is in any sense preferrable/important ways, and current entropy maximization principle doesn't seem to provide explanation in that regard. We will clarify our argument in [l176-187]: At the start point of training, the information bottleneck coincides with the maximum entropy principle when there is no label information. > Missing few notable references ... Thank you very much for the references, which will enrich our literature review. > Assumption 1. We will add more intuitions about this assumption. When the Markov chain can explore all non-zero measure sets, then this assumption is met. The gaussian weight matrices enable the hidden representation to explore. Our (rebuttal) experiments (in Tables 4, and 5) also confirm that the contraction with depth is met in practice. > Lack of empirical validation and the only demonstration does not directly support the main thesis. We agree with the reviewer that experimental validations can further clarify the main contribution. Therefore, we add more experiments, which we will add to the supplementary. We experimentally validate Lemma 9, Theorem 8, and Proposition 4. We initialize an MLP with hyperbolic tangent activation $F(x)=\tanh(x)$. Since $F$ is odd, we can combine Lemma 9, Theorem 7, and Proposition 4 to conclude that $$\| H_\ell^\top H_\ell - \beta_F I_n\| = O(exp(-\ell)+\frac{n^3}{\sqrt{d}})$$ where $\beta_F$ is defined in (Appendix) Eq.9. To validate this, we run the Markov chain of representations and compute $\| H_\ell^\top H_\ell - \beta_F I_n \|_F$. The following table confirms the above bound: The distance rapidly decays with depth (i.e., $\ell$) to a constant proposal to $d^{-1/2}$. **Table 4** |$\ell$ |d=50|d=500|d=5000| |--|--|--|--| |1|4.53|3.35|3.7 |20|1.36|0.62|0.33 |40|1.30|0.34|0.12 We use the following setting for the above results: - Numbers are average over $10$ independent runs - $H_0$: We start the Markov chain with $H_0$ whose rows are obtained by a permutation of random vector $w$ as $H_0[:,i] = w + 0.0001*N(0,I_n)$. Hence the rows of $H_0$ are highly correlated. Lemma 9 indicates that the rows become uncorrelated with depth. We start with $H_0$ with this specific structure to observe the decorrelation. - $\beta_F$: We estimate $\beta_F$ using Eq.9 in the Appendix. For taking expectations in this equation, we used $10^5$ samples. We repeated the above experiment for a network with $\sin$ activations and got similar results in the following table. **Table 5** |$\ell$ |d=50|d=500| d=5000| |--|--|--|--| | 1 | 4. | 4.6| 4.2 | 20 | 1.5 | 0.6| 0.4 | 40 | 1.5 | 0.4 | 0.1 > The paper needs to improve clarity a bit. We will improve the clarity using your helpful comments. **Questions** > For given architecture / input data: does current theory provide a way to construct covariance of approximate Gaussian in large depth (and finite width)? Recall NNGP allows us to compute covariance analytically due to infinite width convergence. In that case, one could utilize the entropy maximization principle as ways of obtaining a novel Gaussian process. In practice, how would one estimate \alpha (contraction factor) or \lambda_1 (smallest eigenvalue of C_*)? It is possible to compute $C_*$ in closed form and also numerically. Proposition 4, and 5 provide a closed-form expression for the covariance matrix: - Odd activations: the covariance matrix $C_*$ is a factor of identity determined by activation where integration in (Appendix) Eq.9. - ReLU: $C_*$ is a matrix whose diagonal elements are equal and all off-diagonal elements are also equal. For other activations, one can numerically run the fixed point iteration in Definition 3 to get the covariance matrix $C_*$. As the reviewer remarked, entropy maximization with depth is a novel, non-trivial approach to constructing a Gaussian process. Experiments (Tables 4 and 5) in the response to Reviewer QgTn show how $\alpha$ depends on the network width and sample size. We will add the experiments and elaborate more on this dependency. >The result is valid for large depth, any width, dataset size and ReLU or (sub-linear+odd) activations. Does the known NNGP limit of infinite-width arbitrary depth case be recovered from current analysis (convergence rate etc)? Yes, our results not only recover the NNGP limit but also give us an explicit dependency of width on the number of samples to meet the Gaussian outputs in the limit of infinite width and samples (width>samples$^3$ in Theorem 8). Furthermore, our results hold even in the regime of infinite depth, while NNGP requires a finite depth and also a width increasing with depth [see reference in line 349]. > In the infinite-width mean field analysis, different "phase" appears depending on weight/bias initialization variances (different phase diagrams exist for different non-linearities). The reviewer is curious about W_std, b_std dependence on Theorem 1, and whether different convergence appears. A phase transition occurs for networks without batch normalization. However, our result is independent of the variance of weights since batch normalization layers are scaling invariant due to the normalization with standard deviations. We will highlight this stark difference in the paper. **Why scaling invarint?** Recall that the chain representation obeys $$H_{\ell+1} = W_{\ell+1}\left(\frac{1}{\sqrt{d}} F\circ BN(H_\ell)\right), [BN(H)]_{i.} = \frac{H_{[i.]} - mean[H_{[i.]}]}{\sqrt{Var(H_{i.})}}$$ Observe that scaling $W_{\ell}$ by $\beta$ will scale $H_\ell$ by $\beta$, but will not affect distribution of $BN(H_\ell)$ and $H_{\ell+1}$ because of BN: $$mean[\beta H_{i.}] = \beta mean[H_{i.}],\quad Var(\beta H_{i.})) = \beta^2 Var(H_{i.})$$ Our definition of maximum entropy and careful scalings ensure that the scaling of weights won't influence the result. # Reviewer kMLa We thank the reviewer for his positive review. > My main issue with the paper is that i am not sure how to parse the theoretical results and evaluate their importance to learning. We will discuss the following applications of entropy maximization: - Gaussian representations are used for the design of self-normalized activations. [Klambauer, Unterthiner, Mayr, and Hochreiter. "Self-normalizing neural networks." Advances in neural information processing systems] design novel activations assuming the representations are gaussian. This careful design of activations effectively accelerates the learning of deep neural networks. The non-asymptotic bounds can be incorporated into this design to enhance the self-normalized activations. For example, one can design an activation that depends on the network width and depth leveraged by the established non-asymptotic Gaussian approximation. - Lemma 9 proves that representations with high entropy remain full rank. The rank of hidden representation is shown to influence learning performance with gradient descent [Ref in line 306]. [Ref in line 306] demonstrates that the rank of representations collapses to one without batch normalization layers, indicating that all outputs become aligned in the same direction, which dramatically slows training. [Ref in line 306] establish the link between rank and gradient of training loss: the gradient direction becomes independent of inputs when the rank of representations collapses. - Gaussian representations have been leveraged in theoretical studies of neural networks. For example, a recent study establishes the implicit bias of random networks toward simple functions [De Palma, G., Kiani, B. T., and Lloyd, S. (2019). Random deep neural networks are biased towards simple functions]. This paper relies on the theoretical regime of infinite width. Our bounds enable such studies to use an explicit Gaussian approximation for neural networks with a finite width used in practice. - Variational encoders and diffusion models[Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli 2015] are optimized to transport inputs to a Gaussian distribution. Here, we prove that increasing depth implicitly implements the transportation to Gaussians. > The findings of the authors are limited to a simple setting, namely MLPs with batchnorm at initialization. The effect of batchnorm on backpropagation, an essential component of learning with gradient descent is left unexplored. Our experiment presented in a general response (**Table 3**) addresses this concern. The literature on random networks does not theoretically study the influence of backprop on representations (Matthews et al., 2018; Schoenholz et al., 2016; Bahri et al., 2020; Pennington & Worah, 2017), and also the following papers - Hanin, Boris, and Mihai Nica. "Products of many large random matrices and gradients in deep neural networks." Communications in Mathematical Physics 376, no. 1 (2020): 287-322. - De Palma, G., Kiani, B. T., and Lloyd, S. (2019). Random deep neural networks are biased towards simple functions—advances in Neural Information Processing Systems. - Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L. (2019). Deep convolutional networks as shallow gaussian processes - Pennington, J., Schoenholz, S., and Ganguli, S. (2018). The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics.