UNDERSTANDING WEIGHT-MAGNITUDE HYPERPARAMETERS IN TRAINING BINARY NETWORKS: a compact review

# UNDERSTANDING WEIGHT-MAGNITUDE HYPERPARAMETERS IN TRAINING BINARY NETWORKS: a compact review This review was written by: Bas Rutteman - 5429439 b.rutteman@student.tudelft.nl ## 1. Introduction For this blogpost, a short and compact review was written about the research paper "*understanding weight-magnitude hyperparameters in training binary networks*" by Quist et al [1]. The aim of this review was to offer another point of view towards the research paper, in order to enable additional comprehensability of its theory. Furthermore, this review also aims to clarify this research's relative role so far in the field of neural networks. This was done by looking at research on which the paper is build, while also investigating for any research that already applied or used any of the paper's findings. Additionally a SWOT-analysis was performed. ## 2. Problem formulation of the paper Generally, the values of a neural network's (NN) hyperparameters are tuned carefully, since these parameters influence the magnitude of real-valued weights during and after training. However for a particular class of networks, binary neural networks (BNN's), weight magnitude is absent, which feeds the assumption these hyperparameters do not yield any meaningful contribution. Because of this, a more thorough insight in the mechanism of training BNN's with respect to their hyperparameters can be a valueable feat. The main point of view used in this paper for finding its insight, is supplied by a theoretical research by Helwegen et al. [2]. In this research, latent real valued weight optimization is approximated by using a variable called 'gradient accumulation'. More about this research can be found in section 5. ## 3. Theory behind binary neural networks [1][4] Binary neural networks (BNN's) do not use weights of any magnitude, but instead are tuned towards a specific binary value of either -1 or +1. However, due to their binary nature, the specific problem of zero gradient will inevitably arise during regular training of such a network. Because of this, these networks are generally trained using latent real valued weights. These weights on the other hand do entail the property of magnitude and are supposedly comparable with basic weights regularly seen in any MLP. [1][4] ### 3.1. Theory and mathematics behind training BNN's A network that is based on the technique of BNN's, generally makes use of primarily two specific techniques, which are stochastic gradient descent (SGD) and the momentum optimization. SGD is the vanilla training technique used to update the weights of a NN through backpropagation by using stochastically chosen sets or *batches* each time round. Momentum is a first order filter technique that looks at the exponential moving average (EMA) of the weight updates. It makes sure noisy updates have less effect on forthcoming updates of the weights. The formula's for SGD (1) and momentum (2) can be found below. It should be noted that the SGD-formula incorporates what is called 'weight-decay', by using the weight decay factor. By incorporating this factor, the weights are designed to get increasingly smaller weights for the network, which basically ensures higher generalization capabilities of the network. $w_{i} = w_{i-1}-\varepsilon(m_{i}+\lambda w_{i-1})$ (1) In the formula for SGD (1), $w_{i}$ represents the latent weight of iteration $i$, $w_{i-1}$, represents the latent weight of iteration $i-1$, $\varepsilon$ represents the learning rate, $m_{i}$ represents the momentum variable of iteration $i$ and $\lambda$ represents the weight decay factor. $m_{i} = (1-\gamma)m_{i-1}+\gamma\Delta_{\theta i}$ (2) In the formula for momentum (2), $m_{i}$ represents the momentum value of iteration $i$, $m_{i-1}$, represents the momentum of iteration $i-1$, $\gamma$ represents the adaptivity rate, and $\Delta_{\theta i}$ represents the change of the binary weight for iteration $i$. After calculating the latent weights by using SGD, the weights are updated in the forward pass by the formula below (3). In this formula (3), $\theta_{i}$ is the binary weight that is calculated using the following formula $\theta_{i} = sign(w_{i})$ (3) However, for this research the theory of Helwegen et al. [2] was implemented. This theory states the weight updates can be approximated as gradient updates which yields a similar formula. In the associated formula (4), $g_{i}$ represents the accumulated gradient of iteration $i$ and $g_{i-1}$, represents the accumulated gradient of iteration $i-1$. $g_{i} = g_{i-1}-\varepsilon(m_{i}+\lambda g_{i-1})$ (4) ## 4. Main contribution of the paper [1] Because of the approximation of the latent weights as gradient accumulators, some BNN hyperparameters can be reinterpreted according to the paper. This yields a better understanding of the effect of tuning these hyperparameters (section 4.1.) and thus enables a more goal-oriented optimization. Furthermore, the coined reinterpretation may also lead to simplification of the hyperparameter tuning for these type of BNN systems alltogether (section 4.2.). These two aspects are the main contributions of the paper. ### 4.1. Novel interpretation of BNN hyperparamaters The BNN hyperparameters that have been looked at in this reseach are: weight initialization, learning rate, weight decay, learning rate decay and momentum optimization. The main findings with respect to these hyperparameters can be found in the subsections of this paragraph. #### 4.1.1 Weight initialization The weight initialization can be reinterpreted due to the approximation of the weights as gradients. Since at the first iteration a gradient has not accumulated any momentum yet, it is common sense to set all initial gradients as $g_{0}$ = 0. However, to ensure not all $\theta_{i}$ are set equal to zero, a stochastic sign function is used which randomly assigns each weight a value of either -1 or +1. #### 4.1.2. Learning rate and weight decay After some standard mathematical manipulation of formula's (1) and (2), the paper arrives at the following formula for the gradient updates (5): $g_{i} = (1-\alpha)g_{i-1}+\alpha m_{i}$ (5) In which $\alpha$ is the multiplication of $\epsilon$ and $\lambda$. By looking at this function it becomes clear that the main contribution of the weight decay and the learning rate parameter, is mainly to be a multiplication of the adaptivity rate used for an EMA-filter. #### 4.1.3. Learning rate decay Since, the learning rate is now only incorporated in the update function as a factor that scales the EMA, the learning rate decay can be designated as $\alpha$-decay. This parameter makes sure how fast the window size increases over training. A lower value means the window size will increase less quickly over training and thus the network will converge less rapidly. #### 4.1.4. Momentum Because both the momentum formula and the gradient formula can be perceived as an EMA-filter. The resulting filter that will be created for a BNN is a second order linear infinite impulse response filter. The resulting formula for each iteration of the updated gradient is. $g_{i} = \alpha \gamma \Delta_{\theta i}-(\alpha + \gamma -2) g_{i-1} - (\alpha - 1)(\gamma - 1)g_{i-2}$ (6) #### 4.1.5. Results from evaluations In order to evaluate the effects of reinterpreted hyperparameters, the parameters of accuracy and flipping-ratio were checked under varying conditions of the network (presence of clipping and/or scaling, varying alpha values, varying learning rate, presence of alpha-decay or not, presence of 0-initialization or not). These effects were evaluated by using a BiRealNet-20 architecture on the CIFAR-10 dataset. The main findings of the evaluations were the following: 1. The second order filter of gradient accumulation is preferable to a first order filter, since it filters high-frequent noise while also adapting quicker to recent changes. The first order filter poses a trade-off between these two properties to the designer. 2. A higher $\epsilon$ and lower $g_{0}$ are independent to scaling and have similar flipping ratios. A too small $\epsilon$ or too large $g_{0}$ do however not reach the same flipping ratios. For sufficiently large ratios, scaling both $\epsilon$ and $g_{0}$ has no effect on training, 3. When using magnitude dependent networks with clipping and no scaling, the learning rate must be carefully tuned in order to not push all weights outside the clipping region, yet still have sensible scaling. However, with initialization set to zero, this problem is solved. 4. A too large α causes too many binary weight flips per update, which makes convergence hard. A too small α makes the network converge too quickly, which may lead to sub-optimal performance. 5. Implementing alpha-decay as opposed to not, leads to better convergence. 6. Alpha decay ensures decline of the flip ratio towards the end of the training procedure and thus ensures better convergence. ### 4.2. Less complex hyperparameter tuning Because of the reinterpretation of the hyperparameters, a similar system can be reduced from using 7 hyperparameters to only the three descibed in the previous sections ($\alpha$, $\gamma$ and $\alpha$-decay). This is achieved by dropping the learning-rate, weight-initialization, clipping and scaling parameters. In the paper it is argued, this makes for easier and less computationally expensive hyperparemeter optimization of such BNN systems. Rationale for this is that, since there are less hyperparameters to tune in a coöporative way, finding the optimal combination is a less complex task to achieve. ## 5. Research the paper builds upon The concerning paper cited a fair amount of papers as a reference. However there are a few that are actually recognizable as the main incentives for the research performed. First of all, according to the paper, the problem formulation of the theoretical discrepancy between certain optimization techniques and the nature of BNN's has been noted multiple times in recent years. Furthermore, these researches also proved the importance of tuning these optimization techniques with prudence. The papers noted to support these claims were Liu et al. [3] and Martinez et al. [4]. First of all the paper of Liu et al. [3], tried to find out whether the common trend of using the Adam algorithm for optimization of a BNN classifier was actually the best option. This widely adopted trend was questioned, since a paper by Wilson et al. [5] empirically showed that Adam algorithms find less optimal minima than a regular SGD with momentum for real valued NN's. Therefore, it seemed better to use a standard SGD with momentum instead. The paper reasoned that with SGD, during training, weights align with the gradient of the magnitude. However, in BNN's there is a high chance of these gradients to be zero. This makes it hard to update weights when there is a bad initialization or local minima at hand. However, Adam turned out to rescale these weights based on previous gradients, which made sure the weights could still be updated. The other paper that is being cited is the one from Martinez et al. [4]. In this paper the general contemporary binarization method for binary convolutional neural networks (BCNN) is questioned in terms of its yielding performance. It is argued that this "direct binarization approach" makes for a high quantization error which makes for low accuracy. In here, two optimization techniques are proposed to solve this issue. The first technique, enforces a loss constraint during training, which allows that the output of a binary convolution matches the output of a real convolution in the corresponding layer better. This accordingly makes more sense than previously used standard backpropagation, since backpropagation has proved to be much less effective for BNN's than for real valued networks. The second technique, called data driven channel re-scaling, shows how to boost the representation capability of a binary neural network with only slightly less operations to be performed. This technique proposes to use a full-precision activation signal, before the binarization operation. These papers have thus also showed that there is some general discrepancy in the field of training BNN's and training real valued networks with regard to their optimization parameters. This implies a possibly even larger research gap with respect to other related parameters for other researches to fill. It should be noted, the paper also mentions other papers to have noted the discrepancy between theory and practice with regard to BNN's such as the research from Tang et al. [6] and Hu et al. [7]. Tang et al. focussed on why BNN's had poor performance or failure when trained on larger datasets. Hu et al. was concerned with the fact that contemporary binarization methods applied on 1x1 convolution systems caused substantially greater accuracy degradations. The paper also mentions a myriad of different machine learning papers describing different techniques to train BNN's in its related works section. The main concept of interpreting the latent weights differently for BNN's, is based on the research of Helwegen et al. [2]. In this research, it is argued that latent-weights can not be seen as an equivalent to real valued weights, but rather as so-called accumulating gradients. This boils down to the following mathematical expression (7). $w^{~} = sign(w^{~}) \cdot \left | w^{~} \right | = w_{bin} \cdot m, w_{bin} \in \left \{ -1,1 \right \} , m \in \left [0,\inf \right ]$ (7) This expression (7) states that a latent weight $w^{~}$ is basically the same as the value of its corresponding binary weight $w_{bin}$, multiplied with the magnitude of the latent weight $m$. This means that as the training goes on for longer and the latent weight increases in magnitude, an increasingly large counteracting gradient is necessary to make the binary weight flip. Updating the latent weights will thus have an increasingly stabilizing effect, which allows for decay of the flip rate over training and thus ensures convergence. Besides the theoretical explanation, the paper found a few results from experimental evaluation. First of all, it was found that implementing various learning rates may yield the same effects as varying the initialization scaling. Second of all, learning rate decay over training steps reduces the influence of noise during training, since it increases the accumulated gradient even more. [2] ## 6. Research that builds on this paper After a google scholar search based on relevancy, it turned out that as of the date of writing this blogpost, the research paper has not been cited yet by any recent other studies. It is very probable that this is mainly due to the short time this paper is publicly available. This was since march 2023. ## 7. SWOT-analysis In order to evaluate the contribution of the research paper, a supplementary SWOT-analysis was performed. ### 7.1. Strengths The paper contains many strengths, which is to be expected for a more theoretical analytical paper. First of all, the paper has a very fundamental approach for explanation of the mechanism behind BNN's training mechanism. Concepts of training in general and related to BNN's are concisely explained. This allows for a more thourough understanding of the process as assumptions and approximations are more logical, because the definitions are well demarcated. Additionally, the paper provides of a lot of recent and relevant researches that seem to align well with the formulated problem at hand. It fills the research gap of additional research towards specific hyperparameters, while also having the fundament of previous research that shows its necessity and rationale. Helwegen et al. [2], liu et al. [3], martinez et al. [4]. Additionally, the paper tries to validate the performance of a system that applies the approximation of latent weight optimization as a second order linear infinite impulse response filter. This was done by training a Bi-RealNet-20 architecture on both the CIFAR-10 and Imagenet datasets and evaluating performance. For both datasets it was shown that similar accuracy values were achieved as for the state of the art. This is a very good strength, since it shows the technique is also viable practically and not just theoretically. Lastly, the performed experimentation results provide a very clear picture of the role of each of the 3 hyperparameters. This is done by providing clear graphs, that not only demonstrate the two important parameters (accuracy and flip ratio), but also compare non-magnitude weight networks with networks that have regular magnitude weights. ### 7.2. Weaknesses The paper mentions two weaknesses of the research by itself in its discussion. The last weakness is one considered by the writer of this blogpost. First of these, is that in the research the filtering-based optimizer is only tested on two datasets and one specific architecture. It would have been good for its experimental value to have been applied to more datasets and/or architectures to validate the optimizers performance for a variety of circumstances. [1] Secondly, the paper does not really in its own words: "provide understanding on why optimizing BNNs with second-order low pass filters works as well as it does." It shows the influence of the varying hyperparameters when varied, however it does not go into detail about what actually happens in the feature space during training these network types. [1] Another weakness in the paper was the lack of a graph which shows the effect of multiple $\alpha$-decay variables. In the research only cosine $\alpha$-decay and no $\alpha$-decay was tested. It would have been insightful and the experimental proof would have been more solid if the effect of alpha decay was tested for a higher variety of variables. ### 7.3. Opportunities The paper offers a couple of opportunites. Firstly, the paper provides another step in the understanding of training BNN's with latent weights. It therefore might also be an incentive for additional research towards this branche of machine learning. This is highly desirable, since these types of networks have proved to be very promising due to their low memory consumption when applied in smaller devices. One type of additional research, as suggested by the paper, is trying to understand the science behind why the second order filter works as well as it does.[1] Secondly, since the paper provides a better better understanding of what happens during BNN training, future BNN hyperparameter tuning can be done more goal-oriented and more efficient. The paper reduces the original 7 hyperparameters to more than halve of 3, which makes optimization procedures such as hyperparameter grid searches much less computational burdersome. ### 7.4. Threats It appeared hard to find any threats with regard to the research paper. However one possible threat could be identified. One threat could be that researhcers simply build new algorithms based on this research's results, without checking the scientific reasons for its functionality. This might lead to blunt assumptions in novel networks based on this theory, that make for worse than optimal performance. This can be seen similar to the way latent real weights were deemed analogues to actual real weights and why the associated line of research of our concerning paper is contributional in the first place. ## 8. Discussion This review tried to offer additional comprehensability with regard to the research paper "*understanding weight-magnitude hyperparameters in training binary networks*" by Quist et al. [1]. This was done by also diving into the sources on which the paper was build. By combining theory of both sources and the research itself, it was aimed to provide an additional point of view. Additionally, the role of this research paper in its own branche of machine learning was assessed. This was done by looking at the research papers the paper was build upon and trying to find research that has already build upon the paper. It became apparent the research paper provides a good contribution to the field of machine learning research. Not only does it add valuable results to the promising area of BNN's, it also allows possibilities for future research which can further strengthen the knowledge about these types of networks. It turned out however, that no contemporary research has taken advantage of the results or theories of this study yet. A Google Scholar search based on relevancy yielded zero results. Furthermore, a SWOT-analysis was performed on the research paper. This analysis showed both the strength of the way the paper was written in terms of clarity as the value of the yielded results and the opportunities that come with these results. It should be noted that the research paper does come with a few weaknesses, most of which where already indicated by the researchers themselves. Apart from one small threat of being too eagerly implemented in upcoming research, there should not be too much to worry about with regard to the results of the paper. ## 9. References [1] **“Understanding weight-magnitude hyperparameters in training binary networks”**. Joris Quist, Yunqiang Li and Jan van Gemert. *ICLR, 2023* [2] **Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization** Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng and Roeland Nusselder. *2019* [3] **How Do Adam and Training Strategies Help BNNs Optimization?**. Zechun Liu, Zhiqiang Shen, Shichao, Koen Helwegen, Dong Huang, Kwang-Ting Cheng. *2021* [4] **TRAINING BINARY NEURAL NETWORKS WITH REAL-TO-BINARY CONVOLUTIONS** Brais Martinez, Jing Yang, Adrian Bulat and Georgios Tzimiropoulos. *ICLR, 2023* [5] **The marginal value of adaptive gradient methods in machine learning** A.C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. *Advances in Neural Information Processing Systems, pp. 4148–4158, 2017.* [6] **Elastic-link for binarized neural networks** .Jie Hu, Ziheng Wu, Vince Tan, Zhilin Lu, Mengze Zeng, and Enhua Wu. *Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 942–950, 2022* [7] **How to train a compact binary neural network with high accuracy?** Wei Tang, Gang Hua, and Liang Wang. *AAAI, pp. 2625–2631, 2017*

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.