BiSAM neurips rebuttal

## Further response to Reviewer hQV7 We would like to clarify that ImageNet-100 is a subset of ImageNet-1k dataset. Due to substantial computational demands and time constraints, conducting experiments on the entire ImageNet-1k dataset is challenging during rebuttal. To further address the reviewer's concern about ImageNet-1k dataset, we train Resnet50 on it using $\frac{1}{3}$ training dataset and test the model on the complete test dataset. We train $2$ runs for $90$ epochs and a batch size of $512$ while otherwise keeping the same parameters as for the ImageNet-100 experiments. The results below are consistent with ImageNet-100 and indicate the positive impact of BiSAM on ImageNet-1k as well. In addition, we will consider adding results on the entire training dataset to the paper. | | SAM | BiSAM(tanh) | BiSAM(-log) | |:---:|:---:|:---:|:---:| | Top 1 | $66.62 \pm 0.43$ | $66.64 \pm 0.80$ | $\mathbf{66.71 \pm 0.21}$ | | Top 5 | $87.68 \pm 0.25$ | $87.68 \pm 0.46$ | $\mathbf{87.75 \pm 0.19}$ | We kindly remind that we've already provided comprehensive empirical results across multiple datasets and assignments including 1) base image classification on CIFAR10, CIFAR100, ImageNet100; 2) noisy labels task; 3) combination BiSAM with ASAM, ESAM (in supplementary), and [mSAM](https://openreview.net/forum?id=wLdmZwvAND&noteId=FN3lmeOGLB); 4) [transfer learning](https://openreview.net/forum?id=wLdmZwvAND&noteId=wJ1DKW9hpI) on Oxford-flowers and Oxford-IITPets datasets. These experiments consistently demonstrate BiSAM's superior performance over SAM, while maintaining the same computational complexity. We summarize **all** the weaknesses mentioned in your rebuttal, and how we addressed them | Weakness | How did we address ? | |:---:|:---:| | experiments on ImageNet | ✅ Performed new experiments with Imagenet-100 and a subset of Imagenet-1K, with positive results| | NLP experiments | ❌ Unfortunately, not feasible during rebuttal | | Improve the quality of writing / correct grammar | ✅ Done for camera-ready version | | Consistent style of References |✅ Done for camera-ready version | | Complexity calculations | ✅ Performed timing of both SAM and BiSAM, confirming similar time per iteration| | Code implementation | ✅ Provided code in rebuttal | Given that we have worked towards resolving all but one of the weaknesses you mention in your rebuttal, and some of them can be easily improved further for the final version (grammar/references/code), we kindly ask you to reconsider your score. Best regards, Authors ## Happy to provide further clarification ### Reviewer 5PcM Dear Reviewer 5PcM, We appreciate the constructive feedback from the reviewer. During the rebuttal, we have taken the following actions: - Provided an example illustrating the failure of maximizing the upper bound while the lower bound approach succeeds. - Explained that BiSAM consistently outperforms SAM in mean accuracy across various neural network architectures, with BiSAM (-log) displaying lower standard deviation than both SAM and SGD. We would appreciate it if you could provide your feedback on our previous responses. Best regards, Authors ### Reviewer vzxj Dear Reviewer vzxj, We appreciate the constructive feedback from the reviewer. During the rebuttal, we have taken the following actions: - Expanded the scope of datasets and architectures. To be specific, we demonstrated empirical results on ImageNet100 and conducted pretrained ViT architecture on Oxford-flowers and Oxford-IITPets datasets. - Clarified that results in supplementary have combined BiSAM with Efficient SAM and Adaptive SAM, and additionally showed the combination of BiSAM with mSAM in the rebuttal. We would appreciate it if you could provide feedback on our previous responses. Best regards, Authors ### Reviewer hQV7 Dear Reviewer hQV7, We appreciate the constructive feedback from the reviewer. During the rebuttal, we have taken the following actions: - Presented empirical results on ImageNet100. - Clarified the comparable computational complexity between BiSAM and SAM, and reported practical timings. - Provided sketch of code implementation for BiSAM algorithm. We would appreciate it if you could provide feedback on our previous responses. Best regards, Authors ## Empirical results of ViT architecture ### Reviewer vzxj To further validate effectiveness of BiSAM, we conduct experiments on both additional architectures and datasets. In particular, we use pretrained ViT-B/16 checkpoint from [Visual Transformers](https://arxiv.org/abs/2006.03677) and finetune the model on Oxford-flowers and Oxford-IITPets datasets. We use AdamW as base optimizer with no weight decay under a linear learning rate schedule and gradient clipping with global norm 1. We set peak learning rate to $1e-4$ and batch size to $512$, and run $500$ steps with a warmup step of $100$. Note that for Flowers dataset, we choose $\mu=4$ for BiSAM(-log) and $\mu=20$ for BiSAM(tanh); and for Pets dataset, set $\mu=6$ for BiSAM(-log) and $\mu=20$ for BiSAM(tanh). The results in the table indicate that BiSAM benefits transfer learning. | | SAM | BiSAM(tanh) | BiSAM(-log) | |:---:|:---:|:---:|:---:| | Flowers | $98.79 \pm 0.07$ | $98.87 \pm 0.08$ | $\mathbf{98.93 \pm 0.15}$ | | Pets | $93.66 \pm 0.48$ | $93.81 \pm 0.45$ | $\mathbf{94.15 \pm 0.24}$ | We hope our rebuttal and further response about ViT address your main concerns. We wish you could consider increasing your score. Thank you! ### Reviewer FXZ7 Thank you for your reply and increasing your score! To further validate effectiveness of BiSAM, we conduct experiments on both additional architectures and datasets. In particular, we use pretrained ViT-B/16 checkpoint from [Visual Transformers](https://arxiv.org/abs/2006.03677) and finetune the model on Oxford-flowers and Oxford-IITPets datasets. We use AdamW as base optimizer with no weight decay under a linear learning rate schedule and gradient clipping with global norm 1. We set peak learning rate to $1e-4$ and batch size to $512$, and run $500$ steps with a warmup step of $100$. Note that for Flowers dataset, we choose $\mu=4$ for BiSAM(-log) and $\mu=20$ for BiSAM(tanh); and for Pets dataset, set $\mu=6$ for BiSAM(-log) and $\mu=20$ for BiSAM(tanh). The results in the table indicate that BiSAM benefits transfer learning. | | SAM | BiSAM(tanh) | BiSAM(-log) | |:---:|:---:|:---:|:---:| | Flowers | $98.79 \pm 0.07$ | $98.87 \pm 0.08$ | $\mathbf{98.93 \pm 0.15}$ | | Pets | $93.66 \pm 0.48$ | $93.81 \pm 0.45$ | $\mathbf{94.15 \pm 0.24}$ | In addition, we will add these new results (including ImageNet100, mSAM, and ViT) to the paper. Thanks for yor suggestions! ## General response **Write a general response summerizing (in bullet form) the additional experiments and their conclusion** We are thankful to the reviewers for their feedback. Below, we address all comments and questions raised. In particular, we'd like to highlight that - We've already considered orthogonality between BiSAM and other variants of SAM. Combining BiSAM with Efficient SAM and Adaptive SAM is provided in the supplementary. - We've provided additional experiment interating mSAM within BiSAM as suggested by Reviewer FXZ7. It further validates that existing variants of SAM can be incorporated within BiSAM. - We've shown empirical results on ImageNet100 which indicate that BiSAM can enhance gerneralization performance in image classification task beyond CIFAR. --- We thank reviewers for suggestion about expanding experiments. We'd like to highlight that, even with limited resource compared with large organizations like Google scale, we do try it across 5 models, 2 dataset and covering SGD, SAM and BiSAM as well as extensive variants of SAM and BiSAM (**see the supplementary for ASAM and ESAM**). To further validate performance of BiSAM, we did additional experiments as follow: - ImageNet100 dataset We show Top1 and Top5 maximum test accuracies for SAM and BiSAM using ResNet-50 trained 100 epochs. Note that we use $\mu=10$ for BiSAM(-log) and other hyperparameters are same as experiments on CIFAR100. We find that BiSAM gets better accuracy than SAM except only BiSAM (-log) getting slightly worse top 1 accuracy while within statistical precision of SAM. It indicates that BiSAM can enhance gerneralization performance in image classification task beyond CIFAR. | | SAM | BiSAM(tanh) | BiSAM(-log) | |:---:|:---:|:---:|:---:| | Top 1 | 83.97 $\pm$ 0.09 | $\mathbf{84.36 \pm 0.38}$ | 83.95 $\pm$ 0.02 | | Top 5 | 96.27 $\pm$ 0.14 | $\mathbf{96.41 \pm 0.10}$ | 96.37 $\pm$ 0.08 | - mSAM We integrate mSAM [[Behdin et al., 2022]](https://arxiv.org/pdf/2212.04343.pdf) within BiSAM as suggested by Reviewer FXZ7. mSAM modifies batch sizes and distributes them among GPUs while does not modify the loss function or the optimization objective. Similar to other variants of SAM, the modification proposed in mSAM can be used together with the modification of the loss function proposed by BiSAM (ours). To further verify it, we conducted experiments using Resnet-56 and WRN-28-2 on the CIFAR-100 dataset, setting $m=4$ (with a micro batch size of 32). The results for mSAM and the two mBiSAM variants (tanh and -log) are presented in the table below. We can see that mBiSAM also can outperforms mSAM. Combined with observation of ASAM and ESAM in supplymentaty, it clearly demonstrates that the incorporation of these techniques enhances generalization, thus supporting our assertion of orthogonality. | | mSAM | mBiSAM(tanh) | mBiSAM(-log) | |:---:|:---:|:---:|:---:| | Resnet-56 | 75.13 $\pm$ 0.25 | $\mathbf{75.35 \pm 0.23}$ | 75.15 $\pm$ 0.39| | WRN-28-2 | 79.79 $\pm$ 0.27 | 79.86 $\pm$ 0.33 | $\mathbf{79.96 \pm 0.17}$ | **References** [Behdin et al., 2022] Kayhan Behdin, Qingquan Song, Aman Gupta, David Durfee, Ayan Acharya, Sathiya Keerthi, and Rahul Mazumder. Improved deep neural network generalization using m-sharpness-aware minimization. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022. ## Reviewer 5PcM (score: 6) We thank the reviewer for their valuable feedback and address all remaining concerns below: > Q1. The argument against maximizing an upper error bound as done in SAM is not fully convincing. That is, since the other player tries to make the upper bound as tight as possible, it may not matter. The argument would benefit from a figure, like Figure 1, that visually shows a scenario where it is obvious that maximizing the upper bound is erroneous. Please provide a figure showing a concrete scenario where it is obvious that SAM makes a mistake by maximizing an upper bound in the inner maximization of eq. 1. A1. To more convincingly make our argument, we will include the following **new example** in our paper. Consider two possible vector of logits: - ``Option A``: $(1/K + \delta, 1/K - \delta, \ldots, 1/K)$ - ``Option B``: $(0.5 - \delta, 0.5 + \delta, 0, 0, \ldots, 0)$ where $\delta$ is a small positive number, and assume the first class is the correct one, so we compute the cross-entropy with respect to the vector $(1, 0, \ldots, 0)$. For ``Option A``, the cross-entropy is $-\log(1/K + \delta)$, which tends to **infinity** as $K\to \infty$ and $\delta \to 0$. For ``Option B``, the cross-entropy is $-\log(0.5-\delta)$, which is a small constant number. Hence, an adversary that maximizes an upper bound like the cross-entropy, would always choose ``option A``. However, note that ``Option A`` **never** leads to a maximizer of the 0-1 loss, since the predicted class is the correct one (zero loss). In contrast, ``Option B`` always achieves the maximum of the 0-1 loss (loss is equal to one), even if it has low (i.e., constant) cross-entropy. This illustrates why maximizing an upper bound like cross-entropy provides a possibly weak weight-perturbation. Unfortunately, this subtlety cannot be summarized in one plot. If, in contrast, we maximize a lower bound like $\tanh(\max_{j \neq y}z_j - z_y)$ where $z$ is the vector of logits (the quantity inside tanh we call negative margin) we obtain: negative-margin for ``option A`` is $-2\delta$ whereas negative-margin for ``option B`` is $2 \delta$, so an adversary maximizing a lower bound would choose ``option B`` which correctly leads to a maximization of the 0-1 loss. Finally, note that maximizing an upper bound is **never** done in the optimization literature as it doesn't provide any improvement guarantees (this should be clear), in contrast, maximizing a lower bound always ensures improvement and is the reason why, for example, gradient descent converges for Lipschitz-gradient objectives. > Q2. Please explain why the BiSAM is better than SAM accuracy-wise despite the high standard deviations. Would it be better to repeat the experiments 10 times rather than just 3 to more clearly see the difference between the algorithms? On the other hand, the means are consistently higher across the various neural network architectures. A2. As the reviewer points out the average accuracy is consistently higher for BiSAM across the various neural network architectures. Concerning the standard deviation, it is actually lower for BiSAM (-log) on average than both SGD and SAM. Additionally, if we average over all models we effectively have 15 runs over which BiSAM still outperforms SAM on average (and under which the variance reduces). We agree that it is always desirable with even more independent executation, but this was not possible under our computational constraints. > Q3. The lower bounds seem to be quite loose from Figure 1. A3. This is not particularly loose as it is similar to corresponding upper bounds such as hinge loss, logistic loss, etc. (see e.g. [loss functions for classification](https://en.wikipedia.org/wiki/Loss_functions_for_classification#/media/File:BayesConsistentLosses2.jpg)). > Q4. The English language is sometimes a bit peculiar/vague: > - On line 129, what is meant by DenseNet? > - On line 166, what is meant by "some particular modifications should be made"? - "DenseNet" was a typo and should have been "neural network". We will correct it in the final version. - The "particular modifications" refers to the Taylor expansion and the use of stochastic feedback that is described in remaining of the paragraph (starting from the subsequent sentence on line 166 until line 172). > Q5. Two typos on line 95 and 107. A5. We thank the reviewer points out 2 typos. On line 95, it indeed should be capital $K$. On line 107, it should refer to Eq.4 rather than Eq.7. We will revise them in the paper. ## Reviewer vzxj (score: 6) We thank the reviewer for their valuable feedback and address all remaining concerns below: > Q1. One major limitation in the scope is apparent: BiSAM is still only applicable to classification tasks. Enabling SAM beyond classification and to more general loss functions in SFT of LLM is much impactful than improving it further in the classification tasks. A1. Extensions beyond classification is definitely an interesting future direction. However, we do not agree that this should stop the community from improving classification itself, as it is still one of the predominant losses considered in ML. Since the reviewer mentioned NLP, it is maybe worth mentioning that classification is also a central objective for language tasks: The next-word prediction used in pre-training of LLMs is a classification task and downstream tasks such as GLUE predominantly consists of classification tasks as well. > Q2. Compared to the original SAM paper, which conducted extensive experiments on classification datasets and included ViT SFT. The evaluation in this paper appears less than adequate. The dataset and the architecture presented in the experiment section is quite limited. A2. Indeed more experimental evaluation cannot hurt. Unfortunately, we do not have the same computational budget as Google, who we would like to point out also *did not consider ViT in the original SAM paper*. Given the computational constraints, our hope is that good **theoretical arguments** and **illustrative experiments** can convince the reader to try our proposed algorithm. With that said, we do conduct extensive experiments across 5 models, 2 dataset and cover 10 different methods (see the supplementary material for variants of ASAM and ESAM) while providing standard deviation for all runs. In addition, we provide: - empirical results on ImageNet100 (in the table below) Note that $\mu=10$ is set for BiSAM(-log) here. - one more variant of SAM combining BiSAM with mSAM (shown in the next question) | | SAM | BiSAM(tanh) | BiSAM(-log) | |:---:|:---:|:---:|:---:| | Top 1 | 83.97 $\pm$ 0.09 | $\mathbf{84.36 \pm 0.38}$ | 83.95 $\pm$ 0.02 | | Top 5 | 96.27 $\pm$ 0.14 | $\mathbf{96.41 \pm 0.10}$ | 96.37 $\pm$ 0.08 | > Q3. The authors claim that BiSAM is orthogonal to other variants/techniques for SAM. However, no experiment result is presented to support the claim. Seemingly orthogonal techniques sometimes actually clash with each other in practice. Thus, I think some experiments testing the combined performance or ablation of the techniques are needed. A3. We already provide experiments in the supplementary material which combines BiSAM with Efficient SAM and Adaptive SAM, and the results show that the incorporation of these techniques enhances generalization. In addition, we integrate mSAM within BiSAM as suggested by Reviewer FXZ7. We conducted experiments using Resnet-56 and WRN-28-2 on the CIFAR-100 dataset, setting $m=4$ (with a micro batch size of 32). The results for mSAM and the two mBiSAM variants (tanh and -log) are presented in the table below. These three formulations clearly demonstrate that the incorporation of these techniques enhances generalization, thus supporting our assertion of orthogonality. | | mSAM | mBiSAM(tanh) | mBiSAM(-log) | |:---:|:---:|:---:|:---:| | Resnet-56 | 75.13 $\pm$ 0.25 | $\mathbf{75.35 \pm 0.23}$ | 75.15 $\pm$ 0.39| | WRN-28-2 | 79.79 $\pm$ 0.27 | 79.86 $\pm$ 0.33 | $\mathbf{79.96 \pm 0.17}$ | > Q4. Side question: Even for classification tasks, recent advances are shifting away from the classical losses like cross-entropy. For example, the contrastive learning losses are yielding some new SOTA models, can SAM be extended to improve that line of works? That is definitely an interesting direction worth investigating but beyond the scope of this work.  ## Reviewer hQV7 (score: 4) We thank the reviewer for their valuable feedback and address all remaining concerns below: > Q1. The reviewer suggests that the authors conduct some experiments on the ImageNet dataset to further validate the performance of the authors' proposed optimizer on larger datasets. A1. To further validate the performance of BiSAM, we conduct experiments on the ImageNet100 dataset. We show the Top1 and Top5 maximum test accuracies for both SAM and BiSAM using ResNet-50 trained for 100 epochs. Note that we use $\mu=10$ for BiSAM(-log) and otherwise use the same hyperparameters as for the experiments on CIFAR100. We find that both Top1 and Top5 accuracy are improved, which suggests that BiSAM can enhance gerneralization performance in image classification task beyond the CIFAR dataset. | | SAM | BiSAM(tanh) | BiSAM(-log) | |:---:|:---:|:---:|:---:| | Top 1 | 83.97 $\pm$ 0.09 | $\mathbf{84.36 \pm 0.38}$ | 83.95 $\pm$ 0.02 | | Top 5 | 96.27 $\pm$ 0.14 | $\mathbf{96.41 \pm 0.10}$ | 96.37 $\pm$ 0.08 | > Q2. The reviewer suggests the authors add some NLP experiments to verify the superiority of the proposed optimizer on different tasks. A2. Indeed more experimental evaluation cannot hurt. However, given the time and computational constraints, our hope is that good **theoretical arguments** and **illustrative experiments** can convince the reader to try our proposed algorithm. As you mention, our proposed modification is **simple** to it should be trivial for researchers in other fields like NLP to try it out (we provide pseudocode further down and intend to make the code available for the final version). > Q3. the reviewer suggests authors use a language editor to improve the quality of writing and to correct some grammatical errors. The references should have a consistent style, since some titles only have the first letter capitalized, and some have the first letter of each word. A3. We will correct this in the updated version of the manuscript. > Q4. The authors mention that the improved method and SAM have similar complexity, but no specific complexity-related calculations are given in the manuscript or relevant comparative experiments are given in the experimental section, so please give an explanation or add relevant derivations or experiments to the manuscript to increase the credibility. A4. This can be seen from the fact that the only change in BiSAM is the loss function used for the *ascent* step. By visual inspection of such loss function, its forward pass has the same complexity as that of vanilla SAM: we use the same logits but change the final loss function. Hence, the complexity should remain the same. We report timings of each epoch on CIFAR10 in our experiments. Note that time of training on CIFAR10 and CIFAR100 are roughly same. | Model | SAM (cross entropy) | BiSAM (logsumexp) | |:---:|:---:|:---:| | DenseNet-121 | 58s | 64s | | Resnet-56 | 23s | 30s | | VGG19-BN | 10s | 16s | | WRN-28-2 | 19s | 21s | | WRN-28-10 | 65s | 71s | The relatively small computational overhead (10% in the best cases) is most likely due to cross entropy being heavily optimized in pytorch. There is no apparent reason why logsumexp should be slower so we expect that the gap can be made to effectively disappear if logsumexp is given similar attention. In fact, it has been [pointed out before](https://ben.bolte.cc/logsumexp) that logsumexp in particular is not well-optimized in pytorch.   > Q5. It would be beneficial if the authors could provide information on the availability of the code implementation for the BiSAM optimizer. A5. In the implementation, the optimizer is defined similarly to SAM. The only necessary modification is to replace the loss function used for computing the perturbation (carried out by `first_step` in the pseudocode below). By building on an [existing SAM implementation](https://github.com/dydjw9/Efficient_SAM/blob/main/utils/sam.py) the implementation of the BiSAM algorithm only requires a few lines: ```python import torch def perturbation_loss_tanh(pred, targets, mu=10, alpha=0.1): for i in range(len(targets)): pred[i] = pred[i] - pred[i][targets[i]] out = torch.logsumexp(mu*torch.tanh(alpha*pred),1) return out def perturbation_loss_log(pred, targets, mu=1, alpha=0.1): for i in range(len(targets)): pred[i] = pred[i] - pred[i][targets[i]] out = torch.logsumexp(mu*(torch.nn.functional.softplus(pred-torch.log(torch.tensor(math.e-1)), beta=-1)),1) return out def bisam(inputs, targets): predictions = model(inputs) if args.optim == 'bisam_log': loss_fct = perturbation_loss_log elif args.optim == 'bisam_tanh': loss_fct = perturbation_loss_tanh # Perturbation loss = loss_fct(predictions, targets, args.mu, args.alpha) loss.mean().backward() optimizer.first_step(zero_grad=True) # Update weights smooth_crossentropy(model(inputs), targets, smoothing=args.label_smoothing).mean().backward() optimizer.second_step(zero_grad=True) ``` ## Reviewer FXZ7 (score: 3) We thank the reviewer for their valuable feedback and address all remaining concerns below: > Q1. I think the biggest issue with the paper is that the numerical experiments are rather limited. The authors only discuss CIFAR and classical architectures. This is while much of the literature in image classification nowadays is focused on ImageNet data, and transformer architectures. A1. Indeed more experimental evaluation cannot hurt. However, given the time and computational constraints, our hope is that good **theoretical arguments** and **illustrative experiments** can convince the reader to try our proposed algorithm. As you mention, the main idea behind BiSAM **is a clever observation and makes intuitive sense.** Regarding the architectures, we have performed experiments on DenseNet-121 Resnet-56 WRN-28-2 WRN-28-10 which still achieve performance not far from state-of-the-art and are still considered a workhorse in computer vision tasks. To broaden our evaluation, we conducted additional experiments on the ImageNet100 dataset using ResNet-50 and WRN-28-2. Table below shows Top1 and Top5 maximum test accuracies for SAM and BiSAM trained for 100 epochs. Note that we use $\mu=10$ for BiSAM(-log) and other hyperparameters are same as experiments on CIFAR100. The table below summarizes the results: | | SAM | BiSAM(tanh) | BiSAM(-log) | |:---:|:---:|:---:|:---:| | Top 1 | 83.97 $\pm$ 0.09 | $\mathbf{84.36 \pm 0.38}$ | 83.95 $\pm$ 0.02 | | Top 5 | 96.27 $\pm$ 0.14 | $\mathbf{96.41 \pm 0.10}$ | 96.37 $\pm$ 0.08 | We find that BiSAM gets better accuracy than SAM except only BiSAM (-log) getting slightly worse top 1 accuracy while within statistical precision of SAM. These results indicate that BiSAM offers enhanced generalization in image classification tasks, extending beyond the CIFAR domain. > Q2. This is ignoring all NLP models and datasets. A2. Due to computational constraints we needed to scope the work somehow and we believe the vision domain alone provides sufficient motivation. This follows several existing works that only considers vision applications such as SAM, ESAM, LookSAM, and FisherSAM. With that said, our modification is simple, so it should not be hard to try our method in NLP tasks. However, such experiments require a lot of computational resources and is beyond the scope of the current work. > Q3. The author also ignore some baselines in the experiments and literature review. mSAM [1,2,3] should be discussed in the paper. In the experiments, GSAM or mSAM should be included as baselines as they have been shown to improve upon SAM consistently. A3. We have added the reference of mSAM in the related work. We remark that similar to other variants of SAM, mSAM does not modify the loss function or the optimization objective, rather it modifies the batch sizes and how they are distributed among GPUs. Hence, the modification proposed in mSAM can be used together with the modification of the loss function proposed by BiSAM (ours). As such, we consider mSAM to not compete directly with our method. To further verify it, we test Resnet-56 on the CIFAR-100 dataset and we set $m=4$ (micro batch size is 32) for implementation. The results in the table below shows that mBiSAM indeed has better performance. | | mSAM | mBiSAM(tanh) | mBiSAM(-log) | |:---:|:---:|:---:|:---:| | Resnet-56 | 75.13 $\pm$ 0.25 | $\mathbf{75.35 \pm 0.23}$ | 75.15 $\pm$ 0.39| | WRN-28-2 | 79.79 $\pm$ 0.27 | 79.86 $\pm$ 0.33 | $\mathbf{79.96 \pm 0.17}$ | Similar to mSAM, GSAM does not compete directly with BiSAM either and we can expect that combining BiSAM with GSAM will improve the performance further. However, we have to leave this work for the future due to time and resources constraints. Nonetheless, it is worth noting that we have already tried combining BiSAM with **ASAM, ESAM** and we find that our new formulation also improves generalization when used together with these two extensions of SAM (see the supplementary material). # Experiments We thank reviewers for suggestion about expanding experiments. We'd like to highlight that, even with limited resource compared with large organizations like Google scale, we do try it across 5 models, 2 dataset and covering SGD, SAM and BiSAM as well as extensive variants of SAM and BiSAM (**see the supplementary for ASAM and ESAM**). To further validate performance of BiSAM, we did additional experiments as follow: - ImageNet100 dataset We show Top1 and Top5 maximum test accuracies for SAM and BiSAM using ResNet-50 trained 100 epochs. Note that we use $\mu=10$ for BiSAM(-log) and other hyperparameters are same as experiments on CIFAR100. We find that BiSAM gets better accuracy than SAM except only BiSAM (-log) getting slightly worse top 1 accuracy while within statistical precision of SAM. It indicates that BiSAM can enhance gerneralization performance in image classification task beyond CIFAR. | | SAM | BiSAM(tanh) | BiSAM(-log) | |:---:|:---:|:---:|:---:| | Top 1 | 83.97 $\pm$ 0.09 | $\mathbf{84.36 \pm 0.38}$ | 83.95 $\pm$ 0.02 | | Top 5 | 96.27 $\pm$ 0.14 | $\mathbf{96.41 \pm 0.10}$ | 96.37 $\pm$ 0.08 | - mSAM We integrate mSAM [[Behdin et al., 2022]](https://arxiv.org/pdf/2212.04343.pdf) within BiSAM as suggested by Reviewer FXZ7. mSAM modifies batch sizes and distributes them among GPUs while does not modify the loss function or the optimization objective. Similar to other variants of SAM, the modification proposed in mSAM can be used together with the modification of the loss function proposed by BiSAM (ours). To further verify it, we conducted experiments using Resnet-56 and WRN-28-2 on the CIFAR-100 dataset, setting $m=4$ (with a micro batch size of 32). The results for mSAM and the two mBiSAM variants (tanh and -log) are presented in the table below. We can see that mBiSAM also can outperforms mSAM. Combined with observation of ASAM and ESAM in supplymentaty, it clearly demonstrates that the incorporation of these techniques enhances generalization, thus supporting our assertion of orthogonality. | | mSAM | mBiSAM(tanh) | mBiSAM(-log) | |:---:|:---:|:---:|:---:| | Resnet-56 | 75.13 $\pm$ 0.25 | $\mathbf{75.35 \pm 0.23}$ | 75.15 $\pm$ 0.39| | WRN-28-2 | 79.79 $\pm$ 0.27 | 79.86 $\pm$ 0.33 | $\mathbf{79.96 \pm 0.17}$ | ## ViT-B/16 ### Cifar dataset We conduct additional experiments on ViT architecture. In particualr, we use pretrained ViT-B/16 checkpoint from [Visual Transformers](https://arxiv.org/abs/2006.03677) and finetune the model on CIFAR10. The base optimizer we set is AdamW with $1e-4$ learning rate, and run $15$ epochs for finetuning. Note that we use $\mu=2$ for BiSAM(-log) and $\mu=15$ for BiSAM(tanh). The results in the table indicate that BiSAM has similar performance with SAM in transfer learning. | | SAM | BiSAM(tanh) | BiSAM(-log) | |:---:|:---:|:---:|:---:| | CIFAR10 | $98.75 \pm 0.09$ | $98.76 \pm 0.12$ | $98.76 \pm 0.08$ | ### Oxford dataset We conduct additional experiments on ViT architecture. In particular, we use pretrained ViT-B/16 checkpoint from [Visual Transformers](https://arxiv.org/abs/2006.03677) and finetune the model on Oxford-flowers [Nilsback & Zisserman, 2008] and Oxford-IITPets [Parkhi et al., 2012] datasets. We use AdamW as base optimizer with no weight decay under a linear learning rate schedule and gradient clipping with global norm 1. We set peak learning rate to $1e-4$ and batch size to $512$, and run $500$ steps with a warmup step of $100$. Note that for Flowers dataset, we choose $\mu=4$ for BiSAM(-log) and $\mu=20$ for BiSAM(tanh); and for Pets dataset, set $\mu=6$ for BiSAM(-log) and $\mu=20$ for BiSAM(tanh). The results in the table indicate that BiSAM benefits transfer learning. | | SAM | BiSAM(tanh) | BiSAM(-log) | |:---:|:---:|:---:|:---:| | Flowers | $98.79 \pm 0.07$ | $98.87 \pm 0.08$ | $\mathbf{98.93 \pm 0.15}$ | | Pets | $93.66 \pm 0.48$ | $93.81 \pm 0.45$ | $\mathbf{94.15 \pm 0.24}$ | **Reference** [Nilsback & Zisserman, 2008] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE, 2008. [Parkhi et al., 2012] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3498–3505. IEEE, 2012. ## Time batch size =128, num_class=100, loop 10k times. | | SAM (cross entropy) | BiSAM (logsumexp) | |:---:|:---:|:---:| | Pytorch | 2.40s | 3.96s | | Tensorflow | 3.25s | 2.34s |