Response to Reviewer w9Py

## Comment to AC, Dear ACs and SACs, We want to extend our heartfelt gratitude to the reviewers once more for their invaluable time and insightful feedback. We are glad that our responses have addressed the majority of the questions raised. Regarding the follow-up queries from reviewer 8GCQ, while the main translation-related issues defined by the reviewer are irrelevant to our paper, we are confident that our point-to-point justifications adequately clarify why the defined translation does not alter our objective and analyses. Additionaly, for the mentioned absence of normalization or restriction effects for the logit that can further bolster the rationale of our design over others, we reiterate findings in our manuscript regarding how these effects have already been achieved by our training objectives. Given we have not received further communication from Reviewer 8GCQ during the final-round discussion, we wish our detailed discussion and summary could be fairly checked when you make the final decision. Thank you once again for your dedication and effort. Sincerely, The authors **************** Is there a new concern from Reviewer 8GCQ Dear ACs, We noticed that Reviewer 8GCQ upgraded his rating one week after the discussion deadline, given we do not have access to view his new comments nor clarify it, we would be pleased if ACs could inform us regarding his new concern and provide us the chance to defend ourselves. If not, we wish this particular condition could be considered when you make the final decision. **************** Dear ACs and SACs, We sincerely appreciate the feedback from Reviewer 8GCQ and the AC grant us the chance to defend ourselves. Regarding the concerns from Reviewer 8GCQ, we want to make our clarifications once again, and we hope our response can still be checked. > Novelty, compared with $|| z - z^E ||$ A: We have demonstrated our response regarding this issue on our general response. Particualr, we compare with the mentioned $|| z - z^E ||$ in Sec. 5.3 (Table 4) in the manuscript and illustrate the large performance gap between these two ideas. We provide our analysis regarding the reason in our intial response for Reviewer 8GCQ: (1) $|| z - z^E ||$ cannot use more information comaprad with $||z - q^E||$, we note the reviewer does not oppose to this analysis; (2) $|| z - z^E ||$ cannot mine hard samples from the expert compared to ours, as we cannot tell if a sample from the expert is easy or hard by just observing $z^E$, we note the reviewer also does not oppose this analysis. > Method, the translation problem compared with $\mathcal{H}(q, q^E)$ and $|| z - z^E ||$ A: We have indeed discussed the defined translation problem and explain why it shouldn't be a problem: including a learnable scalar to the classifier is a common design in existing deep networks, whcih has already been considered at the very begining of the deep network era. Meanwhile, we want to clarify that $|| z - z^E ||$ is not translation invariant, because $|| (z+c \textbf{1}) - z^E || \neq || z - z^E ||$. > Analysis, monotone of the rescaling factor with respect to $q^E$ A: We want to explain why we can observe $\mathcal{F}$ and $\mathcal{F}'$ mostly negative. First for $\mathcal{F} = 1 - \alpha \frac{z_{\ast} - q_{\ast}^E}{1 - q_{\ast}}$, we have $z_{\ast} \gg q_{\ast}^E$ (explained to the reviewer in our final round response, and we notice the reviewer does not oppose it), given $q^{\ast}$ will approximate the label 1 after training, we have $1 - q_{\ast}$ approximates 0. Thus, when $\alpha$ is not infinitely small, we will have $\mathcal{F} < 0$; Secondly, we observe for $\mathcal{F}' = 1 - \alpha \frac{1 - \sum_{c \neq \ast} z_{c} - q_{\ast}^E}{1 - q_{\ast}} \approx 1 - \alpha \frac{z_{\ast} - q_{\ast}^E}{1 - q_{\ast}} = \mathcal{F}$. This is because our logit regularization term $||z - q^E||$ will enforce $z$ to be similar to $q^E$: $\sum z \approx \sum q^E = 1$. Note as mentioned, the main loss $\mathcal{H}(softmax(z), y)$ and the logit regularization term serve the same purpose (the ground-truth logit to correspond to larger ones), thus fullfing $z \approx q^E$ will not violate the main regularization. In other words, enforcing $z \approx q^E$ satisfy both the main classification loss and the logit regularization. Since we have both $\mathcal{F}$ and $\mathcal{F}'$, we will have the absolute values of them monotonic increased w.r.t $q^E$. We sincerely apologize for the late reply, due to system settings in Openreivew, we do not get the timely notification regarding the late update of reviews from Reviewer 8GCQ. We wish our response can be fairly checked when you make the decision. We sincerely thank you for your dedication and effort. Best regards, The authors ## Summary of Discussions Dear ACs and SAC, We sincerely thank all reviewers for their invaluable efforts and dedication in improving our work. We are glad that all reviewers are positive after an engaging discussion phase. We provide a summary of discussions in the following. ## Discussion with Reviewer w9Py > Initial question is mostly about scaling our method with a larger ViT-based backbone. Regarding this concern, we conduct further experiments with existing methods using an 8 times larger ViT_Base backbone. The results show that our method can obtain promising results when being scaled up. We are glad that the reviewer agrees these results are helpful for further validating the effectiveness of our method. ## Discussion with Reviewer 8GCQ > Initial questions are mostly about the unfamiliarity with DG background, concept in the related work, misunderstandings regarding details of our method, and rationale for designing a new KD scheme rather than using existing solutions despite the superior performance of our new design. Our response provides a concise basis of the DG background and additional details of our method. Rationale for designing the new KD scheme is provided: our design is adjusted from the MSE loss, inspired by the pioneering idea [c] that is adjusted from the crossentropy loss. Deeper analyses are conducted (based on our analysis in the manuscript) to justify why the new design outperforms existing solutions: existing ideas struggle to leverage the advantages inherent in our KD design (using more features and mine hard samples from the experts). We are glad that most questions raised by the reviewer have been addressed satisfactorily, and he gets a better comprehension of our method with a positive rating. Main concerns raised during the discussion are summarized as follows, > The reviewer believes that a regularization term should be translation invariant (according to the reviewer, translation is defined to be adding a learnable scalar to the output logit vector from the model), and our new KD scheme is viewed unintuitive compared to existing ideas as it violates the so-called translation invariance despite achieving promising performance. We respond that, first, the so-called translation invariance is unnecessary when designing a training objective, a counterexample is that an objective using MSE loss (enforcing similarity between the ouput logit and label) is not translation invariant, but performs as well as the translation-invariant crossentropy loss [a, b]; Second, the mentioned existing KD idea (enforcing similarity between logits from the teacher and student) is also translation variant, thus it is unreasonable to justify our scheme unintuitive compared to others for this reason. > The reviewer believes that the implementation of MSE loss should be enforcing similarity between the output probabilities and label, thus an intuitive KD scheme adjusted from MSE loss should be probability regularization (enforcing the probabilities from the target and experts to be similar). We respond that the proper implementation for MSE loss should be enforcing similarity between the ouput logit (not probability) and label [a] (which we gladly find the reviewer is persuaded). Thus our logit regularization KD scheme is a proper adjustment of the MSE loss. > The reviewer believes that the translation (defined by the reviewer) should affect our analyses. We respond that including a learnable scalar for translation actually does not fundamentally change the original training objective, and therefore, our analyses should remain valid. Meanwhile, while the translation introduced by the reviewer is interesting, we must point out that it is not included in our framework, and we cautiously assert that it is irrelevant to our main task as there is no clear evidence that such translation can improve generalization. We thus kindly suggest that this work should be justified without concerning the so-called translation-related issues. > The reviewer believes that our new KD scheme is unintuitive as it lacks normalizing or restricting for the logit vector. We respond that the restriction effect can be achieved by the main classification loss and our logit regularization term, as the former encourages the ground-truth logit to increase ceaselessly and vice versa for the non ground-truth logits, and the latter encourages the logit to be within the range of probability, the final logit will be balanced between these two restriction effects (as mentioned in the first analysis in Sec. 3.2 in the manuscript). [a] Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks, in ICLR'21. [b] Understanding Square Loss in Training Overparametrized Neural Network Classifiers, in NeurIPS'22. [c] Distilling the Knowledge in a Neural Network, Hinton et. al. ## Discussion with Reviewer Qpvc > The initial concerns are mainly about the training cost and the improvements in the DomainBed benchmark. We respond that the training cost will inevitably larger than the baseline during training, as the method is designed following the KD pipeline where teacher and student are involved, and the training cost for our method will always be doubled compared to the baseline with the test cost same as the baseline. Meanwhile, we explain that the average of 2.8pp improvements compared to the baseline are actually promissing in DomainBed benchmark due to rigrous evaluation protocal. We are glad that the reviewer find our response satisfactory. ## Discussion with Reviewer rg4W > The main concerns are mainly about analyzing the effectiveness of existing arts with similar training resources, comparing with the Model Soup aggregation scheme, and the importance of experts. We have conducted experiments using the suggested settings by the reviewer. Experimental findings align with the results from the similar analyses in our manuscript. We are glad that the reviewer is satisfied with these analyses. Meanwhile, although noverty is raised as an issue, we find the reviewer does not provide any similar ideas or papers supporting his claim. We provide a through comparisons with existing ideas in our general response to detail the differences of our method against existing ideas, we kindly suggest ACs and SAC could check our response regarding this issue. **************************************** ## General Response We thank all reviewers for their invaluable time and thoughtful suggestions for improving our work. We are glad to find that all reviewers agree the experiments are extensive and the results are impressive. It is also inspiring that the simpleness of our method is acknowledged by Reviewer w9Py and rg4W, and the analyses are regarded convincing and extensive by Reviewer w9Py, 8GCQ, and Qpvc. We have collected common questions from reviewers and answer each of them as follows: > Training and testing procedures with pseudocode. We provide a pytorch-style pseudocode for the training and testing procedures as follows, this will be included in our revised manuscript. ```python # M: total domain numbers # K: total classes # alpha: weight parameter # lr and weight_decay: set according to the DomainBed benchmark # Initialization steps network = [None] * (M + 1) # M experts and 1 target model params = [] for i in range(M + 1): network[i] = ResNet() params.append{"params": network[i].parameters()} optimizer = Adam(params, lr, weight_decay) # Training steps def train(minibatches): # Each trainining batch in DomainBed contains samples from M domains loss = 0 all_x = torch.cat([x for x, y in minibatches]) # All samples from the M domains all_y = torch.cat([y for x, y in minibatches]) # All labels from the M domains for i in range(M): # Training the M experts xi, yi = minimabtch[i][0], minibatch[i][1] # Data pairs from the i-th domain z_Ei = network[i](xi) # The i-th expert will only use data from the corresponding domain # Note that the softmax function is encoded in cross_entropy in pytorch, thus the following line should be cross_entropy(z_Ei, yi) in actual implementation loss += cross_entropy(softmax(z_Ei), yi) # Eq. (1) in the manuscript qE = softmax(z_Ei) if i==0 else torch.cat((qe, softmax(z_Ei)), 0) # Concat probabilities from experts z = network[-1](all_x) # logit from the target model: T(x) in Ln 146 in the manuscript loss += cross_entropy(softmax(z), all_y) # classification loss for the target model: L_cla in the manuscrip loss += alpha * MSELoss(z, qE.detach()) # logit regularization for the target model: L_guid in the manuscript # Updating for the experts and target model optimizer.zero_grad() loss.backward() optimizer.step() # Test step def test(test_samples): result = network[-1](test_samples) ``` > Computational Cost Requiring extra training cost is inevitable in existing knowledge distillation-based arts as the training process often involves both the teacher and student networks. For our method, the training cost will be doubled compared to the baseline ERM although domain numbers vary. This is because that each training sample consumes two forward passes: the first one for the target model and the second one for the corresponding expert. This has been stated in Ln 180 - 184. Notably, during the inference stage, our method consumes the same computational cost with ERM, as we only utilize the target model. > Difference upon existing frameworks We illustrate our difference upon existing frameworks as follows: - Traditional knowledge distillation (KD). Although our logit regularization term is within the scope of KD, it is different from existing schemes as it takes the soft label from the teachers as guidance in a regression manner, while existing schemes often use the soft label for entropy computation. By conducting extensive experiments on different tasks and theoretical analysis, we validate the effectiveness of our new design and superiority to the existing KD schemes (in Sec. 5.3). - Label smoothing. One beneficial effect brought by our logit regularization term is to regularize the probability of the target model within a smaller range. We have detailed in Sec 5.2 that compared to the label smoothing, our method does not require heuristic settings for determining the probabilities, and it brings hard sample mining which leads to better performance. - Mixture-of-Experts (MoE). Aggregation strategies are the main differences between our method and existing MoE schemes. Other than using the heuristic average or complex meta-learning aggregations, we use a simple and intuitive aggregation with our logit regularization (differences are detailed in Ln 126-148). We also conduct experiments to validate the effectiveness of our aggregation design as shown in Sec. 5.4 and C. - Further contributions. We provides more insights for improving generalization based on the analysis in Sec. 3.2. First, we present a free lunch which combins the cross-entropy and MSE losses for model training. We show in Sec. 5.1 that this free lunch obtains competitive results without additional computational cost or extra hyper-parameter tuning compared to the baseline; Second, we find that hard samples from the experts reside mostly in the mixed region of different domains, which can better help generalization than hard samples from other models (as shown in Sec. B.4). This finding may inspire future research on mining hard samples to improve generalizations. Detailed questions are answered below. All experiments in the rebuttal are conducted with the same experimental settings introduced in our paper where the DomainBed benchmark is utilized. As the DomainNet dataset takes much more time to evaluate, we thus conduct experiments on the remaining four datasets (i.e., PACS, VLCS, OfficeHome, and TerraIncognita). These evaluations will be included in our revised manuscript. ***************************************** # Response to Reviewer w9Py We thank the reviewer for the appreciation of the simplicity and effectiveness of our method, your positive comment on our theoretical analysis is also inspiring. > Q1: The backbone networks considered in the work are relatively small (ResNet). How would the method scale with large networks (for ex: a very large vision transformer model)? Can the authors comment on this? A1: Regarding your concern about scaling our method with a larger ViT backbone, we conduct further experiments by using a larger ViT-Base model, which is with nearly 8 times parameters compared to the ResNet18 model that we use in the manuscript (86m v.s. 11m). The comparison results are listed below (results from the ResNet model are in the bracket). | | PACS | VLCS | Office | Terra|Avg | :-------- | :----- | :- | :-- | :-- |:----: | |ERM |86.2 $\pm$ 0.1 (79.8 $\pm$ 0.4)|78.2 $\pm$ 0.3 (75.8 $\pm$ 0.2)|74.7 $\pm$ 0.1 (60.6 $\pm$ 0.2)|44.1 $\pm$ 0.3 (38.8 $\pm$ 1.0)|70.8 (63.8)| |CORAL|86.4 $\pm$ 0.3 (81.7 $\pm$ 0.0) |79.1 $\pm$ 0.6 (75.5 $\pm$ 0.4) |75.8 $\pm$ 0.2 (62.4 $\pm$ 0.4) |45.0 $\pm$ 0.7 (41.4 $\pm$ 1.8)|71.6 (65.3)| |SD|87.0 $\pm$ 0.3 (81.9 $\pm$ 0.3) | 79.0 $\pm$ 0.3 (75.5 $\pm$ 0.4) |75.6 $\pm$ 0.1 (62.9 $\pm$ 0.2) | 45.6 $\pm$ 0.6 (42.0 $\pm$ 1.0)|71.8 (65.6)| |Ours|87.6 $\pm$ 0.3 (82.4 $\pm$ 0.1)|79.4 $\pm$ 0.3 (76.2 $\pm$ 0.1) |75.8 $\pm$ 0.2 (63.2 $\pm$ 0.1)|47.4 $\pm$ 0.5 (46.3 $\pm$ 0.5)|72.6 (67.0)| We note that our method can consistently improve the baseline in both backbones, and it also obtains better results than current leading arts despite using a larger network. These results indicate that our method is still effective when the backbone is scaled up. **************** # Response to Reviewer 8GCQ We thank the reviewer for the invaluable suggestions for improving our paper. Raised questions are answered below. > W1: The notation is mathematically inaccurate or unconventional and occasionally inconsistent. For instance, the dependency of loss functions $\mathcal{L}_{\text{cla}}$ and $\mathcal{L}_{\text{guid}}$ to the source and target distribution and the parameters of each model is not specified which makes it confusing for the reader. - A1_part1: Dependency of loss functions. As the DG task has no access to the target data during training, all losses introduced in the manuscript, including $\mathcal{L}_i$ (which is the classification loss for training the $i$-th expert), $\mathcal{L}_{cla}$ and $\mathcal{L}_{guid}$ (which are the classification loss and logit regularization for training the target model) depend on the source distribution. Specifically, $\mathcal{L}_i$ will depend on data from the $i$-th source domain; $\mathcal{L}_{cla}$ and $\mathcal{L}_{guid}$ will depend on all source domains. - A1_part2: Parameters of each model. For the $i$-th expert $E_i$, its parameter can be obtained with $E_i = \arg\min\limits_{E_i} \mathcal{L}_i = \arg\min\limits_{E_i} \mathcal{H}(\text{softmax}(E_i(x^i)), y^i)$, s.t. $(x^i, y^i) \in \mathcal{D}_i$. For the target model $T$, its parameter can be obtained with $T = \arg\min\limits_{T} \mathcal{L}_{cla} + \sum_i\mathcal{L}_{guid} ^i= \arg\min\limits_{T} \mathcal{H}(\text{softmax}(T(x)), y) + \sum_i \Vert T(x^i) - \text{softmax}(E_i(x^i))\Vert^2$, s.t. $(x, y) \in \mathcal{D}$ and $(x^i, y^i) \in \mathcal{D}_i$. We will revise this part in the manuscript for better communications. > W2: The rescaling factors $\mathcal{F}$ and $\mathcal{F}'$, as discussed in Section 3.2, lack proper introduction. Moreover, in the same section, in the last paragraph $\mathcal{F}$ and $\mathcal{F}'$ are mistakenly denoted as F and F′. - A2_part1: Proper introduction of the rescaling factors. The rescaling factor is defined as the comparing ratio of gradients from two loss functions. The gradients from the base loss function are the denominator and gradients from a modified loss function are the numerator. Intuitively, for a specific training sample, a large rescaling factor indicates that the modified loss can magnify the gradients compared to that from the base loss. The gradient magnification indicates that the model updating process emphasizes more on this specific sample. We will include the brief presentation in our manuscript as suggested. - A2_part2: Mistakenly denoted $\mathcal{F}$ and $\mathcal{F}'$ as F and F′. We thank the reviewer for the notification, we will correct it in our revised version. > W3: The presentation of the training procedure is vague and can be improved by including a pseudocode for LMoE. A3: We have provided the pseudocode in our general response and will add it into our revised manuscript. > W4: It should be noted that this method is not a ``Mixture-of-Experts (MoE)" approach. The aggregation of the expert models is uniform and does not depend on the input data. I believe LMoE is not an appropriate name for this method. A4: We thank the reviewer for the suggestion. As the overall method involves guiding a target model with different experts, we will revise the title as: "LFE: Learning from Experts for Domain Generalization". > W5: It would enhance the presentation of the paper to cite the baseline methods directly in Table 1. A5: We thank the reviewer for the suggestion. We want to clarify that our method is developed based on the common baseline ERM, which only uses the classification loss $\mathcal{L}_{cla}$ for the target model. Results from the baseline have been included in the 12th row in Table 1. > Q1: It is not clear whether the output of the target model, $T(\cdot)$, is the one-hot vector, the predicted probability vector, or the score vector associated with different classes. A1: As stated in Ln 146, $T(\cdot)$ outputs the logit vector $z \in \mathbb{R}^{K}$ of an input sample, and the corresponding predicted probability vector $q \in \mathbb{R}^{K}$ can be computed by applying softmax on $z~~ (\text{i.e.,}~~q = \text{softmax}(z))$. > Q2_part1: Is there any specific rationale behind selecting $\Vert z - q^E\Vert$? - A2_part1: Rationale for using $\Vert z - q^E\Vert$. A common idea for KD is to use the output from the teacher as a soft label for the student. For the classification task, the cross entropy loss $\mathcal{H}(q, y)$ is widely used in the literature. An intuitive revision of $\mathcal{H}(q, y)$ to achieve distillation is thus by replacing the ground-truth label $y$ with $q^E$ in an entropy manner (i.e., $\mathcal{H}(q, q^E)$). Nevertheless, some recent studies [a, b] suggest that the MSE loss $\Vert z - y \Vert$ performs as well as the cross entropy loss when being applied in the classification task. Correspondingly, a distillation scheme motivated by MSE loss can thus be replacing the ground-truth label $y$ with $q^E$ in a regression manner, which comes to our logit-regularized term: $\Vert z - q^E \Vert$. Accordingly, the introduced logit regularization term is still a reasonable KD framework. To the best of our knowledge, this work represents pioneering effort for exploring this new distillation scheme, which vouches for the novelty of our work. Extensive experiments and theoretical analyses are conducted to validate the effectiveness this new scheme, which we gladly found the reviewer also considers it as an advantage. [a] Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks, in ICLR'21 [b] Understanding Square Loss in Training Overparametrized Neural Network Classifiers, in NeurIPS'22 > Q2_part2: I believe that $\mathcal{H}(q, q^E)$ or $\Vert z - z^E\Vert$ would be more intuitive despite the superior performance of LMoE. - A2_part2: Why not use existing KD schemes. According to our analysis in Sec. 3.2, these two terms have difficulties achieving the two beneficial effects from our logit regularization term. First, our logit regularization ensures using more information for the target model by implicitly regularizing the probabilites in a much smaller range. This is hard to achieve by either $\mathcal{H}(q, q^E)$, where the probability $q$ will be in a similar range as ERM because the label $y$ and soft label $q^E$ are all in $[0, 1]$, or $\Vert z - z^E\Vert$ which does not provide range regularization effect for the probability. Second, our logit regularization enables the target model to mine hard samples from the experts by magnifying gradients from samples that the expert is with less confidence. This effect is also difficult to achieve by the two distillation terms. Specifically, for $\mathcal{H}(q, q^E)$, the corresponding rescaling factors can be represented as $\mathcal{F}'_{\mathcal{H}(q, q^E)}$ = $\mathcal{F}_{\mathcal{H}(q, q^E)} = 1 - \alpha \frac{q_{\ast} - q_{\ast}^E}{1 - q_{\ast}}$. As $q_{\ast} - q_{\ast}^E \ll z_{\ast} - q_{\ast}^E$, this will lead to much smaller rescaling factors compared to our design, resulting compromised hard sample-mining effect. For $\Vert z - z^E\Vert$, a counterexample can be obtained by just analyzing the corresponding ground-truth rescaling factor: $\mathcal{F}_{\Vert z - z^E\Vert} = 1 - \alpha \frac{z_{\ast} - z_{\ast}^E}{1 - q_{\ast}}$. This term cannot ensure hard samples from the experts are with larger rescaling value, as justification of hard samples is irrelavant of $z_{\ast}^E$: even if $z_{\ast}^E$ is small, it may still be a easy example for the expert due to smaller $z_c^E$. We thank the reviewer for the effort improving our paper. The above analyses will be included in our revised manuscript. > Novelty Please refer to our clarification regarding comparisons with existing ideas in the general response. **************** Dear Reviewer 8GCQ, Thanks again for your insightful suggestions and comments. As the deadline for discussion approaches, we want to ensure that we address any remaining uncertainties or questions you may have. We have thoroughly studied your comments and have made efforts to provide additional clarifications in our previous responses, particularly concerning the representation of our methodology. We sincerely hope that you find our explanations and further details satisfactory and that they contribute to a clearer understanding of our training objective. Please do not hesitate to contact us if there are other clarifications or experiments we can offer. Once again, we appreciate your time and attention to our work. Best regards, The Authors **************** We sincerely thank the reviewer for the deep feedback. According to your definition of scale, we note that when $c\neq 1$, $\mathrm{Softmax}(\mathbf{z}\_r) = \frac{exp({\mathbf{z}\_r})}{\sum\_{k=1}^{K} exp(\mathbf{z}\_k)} \neq \frac{exp(c {\mathbf{z}\_r})}{\sum\_{k=1}^{K} exp(c \mathbf{z}\_k)} = \mathrm{Softmax}(c \mathbf{z}\_r)$ (i.e., $\mathrm{Softmax}(\mathbf{z}) \neq \mathrm{Softmax}(c \mathbf{z})$). In other words, $\mathrm{Softmax}(\cdot)$ function is not scale invariant. We thus only consider the case when the translation is applied: $\hat{\mathbf{z}} = \mathbf{z} + c \mathbf{1}$, s.t. $c \in \mathbb{R}^1$ and $\mathbf{z} \in \mathbb{R}^d$, where $\mathbf{1} \in \mathbb{R}^d$ is the all-one vector. We respond to further comments below, > C1: Adding bias to the output of the models will change the regularization term but does not change the classification error. This is counterintuitive as the regularization term does not satisfy scale and translational invariance. Although $||\mathbf{z} - \mathbf{z}^E||$ does not satisfy translation invariance, the hyperparameter $\alpha$ can capture the proper scale. A1: We appreciate the reviewer for offering a new insight for analyzing our method. First, we want to highlight that akin to $||\mathbf{z} - \mathbf{q}^E||$, using $||\hat{\mathbf{z}} - \mathbf{q}^E||$ will not undermine the primary discrimination ability. This is because the translation preserves the relative ordering of the elements. Similar to $\mathbf{z}\_{\ast}$, $\hat{\mathbf{z}}\_{\ast}$ will be enforced to approximate the largest value in $\mathbf{q}^E$ (i.e., $\mathbf{q}^E\_{\ast}$, which approximates the ground-truth $\mathbf{y}\_{\ast}$ suppose that the expert performs good at the corresponding domain). This indicates that both regularization terms will encourage the model to make good predictions (as briefly mentioned in Ln 203 - 205). Thus, adding bias to the output will not diminish the model performance, even though the regularization term may vary with respect to translation. Second, same as $||\mathbf{z} - \mathbf{z}^E||$, the hyperparameter $\alpha$ (note for both experiments with $||\mathbf{z} - \mathbf{z}^E||$ and $||\mathbf{z} - \mathbf{q}^E||$, $\alpha$ is randomly selected in a large range [0.01, 10] as denoted in Ln 244) can be used with $|| \mathbf{z} - \mathbf{q}^E||$ for capturing the proper translation. As such, $||\mathbf{z} - \mathbf{q}^E||$ is as intuitive as $||\mathbf{z} - \mathbf{z}^E||$. Considering the current lack of clear evidence supporting the necessity of adhering to the mentioned translation invariance for designing regularization terms, we respectfully hold a different perspective on this matter. > C2: This lack of invariance might affect your reasoning regarding rescaling factors. For instance, assuming that we add a bias term to the models, the rescaling factor changes but in the other two KD approaches rescaling factors remain unchanged. A2: We would like to highlight that, the rescaling factor is a tool used for analyzing if hard samples from the experts (i.e., samples with low confidence in the expert) can be given greater emphasis in the target model (i.e., larger rescaling value). We want to point out that the translation of $\mathbf{z}$ will not affect our analysis in Sec. 3.2, where we show that our logit regularization term can help the target model to mine hard samples from the experts. Denoting $\mathcal{F}$ and $\mathcal{F}'$ as the ground-truth and non ground-truth rescaling factors respectively, for our logit regularization term $||\mathbf{z} - \mathbf{q}^E||$, we have $\mathcal{F}\_{||\mathbf{z} - \mathbf{q}^E||} = 1 - \alpha \frac{z\_{\ast} - q\_{\ast}^E}{1 - q\_{\ast}}$, and $\mathcal{F}'\_{||\mathbf{z} - \mathbf{q}^E||} = 1 - \alpha \frac{1 - \sum\_{c \neq \ast} z\_{c} - q\_{\ast}^E}{1 - q\_{\ast}}$. When including a constant scalar $c$ to $\mathbf{z}$ for all samples (i.e., replacing $\mathbf{z}$ with $\hat{\mathbf{z}}$), we can observe the same phenomenon from the manuscript: both $\mathcal{F}$ and $\mathcal{F}'$ are strictly monotonic increased regarding the value of $q\_{\ast}^E$, when $q\_{\ast}^E$ is with smaller value (which correspond to hard samples in the expert), the rescaling factors will be larger, the target model can still emphasize on these hard samples. Please also note that the implementations of the linear layer have already taken the bias term into account, indicating $\mathbf{z} \triangleq \mathbf{z} + c~~\textbf{1}$, when $c$ is a learnable scalar. In conclusion, our analysis in Sec. 3.2 holds despite the translation of $\mathbf{z}$ with arbitary $c$. While for $\mathcal{H}(\mathbf{q}, \mathbf{y})$, we show the hard sample mining effect will be compromised compared to $||\mathbf{z} - \mathbf{q}^E||$, and for $||\mathbf{z} - \mathbf{z}^E||$, we show its ineffectiveness in mining hard samples. Details are provided in A3 for C3. > C3: I don't quite understand your reasoning for why $||z - q^E||$ is better than $\mathcal{H}(q, q^E)$ and $||z - z^E||$ in terms of the rescaling factor. I would appreciate it if the authors could explain it more clearly. A3: We can revisit the rescaling factors of these two KD schemes as follows, For $\mathcal{H}(\mathbf{q}, \mathbf{y})$, $\mathcal{F}'\_{\mathcal{H}(q, q^E)}$ = $\mathcal{F}\_{\mathcal{H}(q, q^E)} = 1 - \alpha \frac{q\_{\ast} - q\_{\ast}^E}{1 - q\_{\ast}}$. Without the loss of generality, we have $\mathbf{q}\_{\ast} \ll \mathbf{z}\_{\ast}$, and $q\_{\ast} - q\_{\ast}^E \ll z\_{\ast} - q\_{\ast}^E$, which leads to $|\mathcal{F}\_{\mathcal{H}(q, q^E)}| \ll |\mathcal{F}\_{||\mathbf{z} - \mathbf{q}^E||}|$, and $|\mathcal{F}'\_{\mathcal{H}(q, q^E)}| \ll |\mathcal{F}'\_{||\mathbf{z} - \mathbf{q}^E||}|$, given that the rescaling facotr is mostly negative and $\mathcal{F}\_{||\mathbf{z} - \mathbf{q}^E||} \approx \mathcal{F}'\_{||\mathbf{z} - \mathbf{q}^E||}$ as observed in Figure 2 (e). This equation denotes that even if the expert is with low confidence for the sample (i.e., $\mathbf{q}\_{\ast}^E$ is small), emphasis from using $\mathcal{H}(\mathbf{q}, \mathbf{y})$ will always be smaller than that from using $||\mathbf{z} - \mathbf{q}^E ||$, resulting compromised hard sample mining effect. For $||\mathbf{z} - \mathbf{z}^E||$, $\mathcal{F}\_{||\mathbf{z} - \mathbf{z}^E||} = 1 - \alpha \frac{z\_{\ast} - z\_{\ast}^E}{1 - q\_{\ast}}$, and $\mathcal{F}'\_{||\mathbf{z} - \mathbf{z}^E||} = 1 - \alpha \frac{\sum\_{c \neq \ast} z\_{c}^E - \sum\_{c \neq \ast} z\_{c}}{1 - q\_{\ast}} \approx 1 - \alpha \frac{- \sum\_{c \neq \ast} z\_{c} - z\_{\ast}^E}{1 - q_{\ast}}$ (assuming $\sum\_c z_c^E = 0$, similar in [c]). We can observe that both two terms are irrelevant of hard samples, because unlike $\mathbf{q}_{\ast}^E$, we cannot use $\mathbf{z}\_{\ast}^E$ to determine the confidence of the expert: for a hard sample, the corresponding $\mathbf{z}\_{\ast}^E$ can be smaller or larger than that from a simple one. We thus conclude that $||\mathbf{z} - \mathbf{z}^E||$ cannot help mine hard samples from the experts. > C4: I believe in [a] they suggest that MSE loss $||\mathbf{q} - \mathbf{y}||$ (not $||\mathbf{z} - \mathbf{y}||$) performs as well as cross-entropy loss $\mathcal{H}(\mathbf{y}, \mathbf{q})$. This can be translated to $||\mathbf{q} - \mathbf{q}^E||$ rather than $||\mathbf{z} - \mathbf{q}^E||$. A4: Please note that their implementation of the MSE loss is actually $||\mathbf{z} - \mathbf{y}||$. We can find the detail in the first paragraph in Sec. 5 from [a], and I quote the exact sentences here: "$\textbf{No softmax}$. The widely accepted pipeline for modern neural classification tasks trained with the crossentropy loss contains the last softmax layer before calculating the loss. When training with the square loss that layer needs to be removed as it appears to impede optimization." Thus, the translation should be in our logit regularization term: $||\mathbf{z} - \mathbf{q}^E||$. > C5: It cannot be guaranteed to work for all models as the regularization term does not satisfy translational and scale invariance. A5: First, we want to clarify that we have conducted experiments with both CNN and ViT (please see our response for Reviewer w9Py), and the results can validate the effectiveness of our design. Second, as explained in A1 for C1, we believe the mentioned translation invariance may be irrelevant for model prediction, because a translation variant regularization term will also encourage the model to make a good prediction, as long as the largest value corresponds to the ground-truth label. This can be further validated by the objective $||\mathbf{z} - \mathbf{y}||$, which is translation variant according to your definition, but has been shown to work well in various vision and natural language tasks [a]. [c] Distilling the Knowledge in a Neural Network, Hinton et. al. **************** We sincerely appreciate the prompt feedback. First and foremost, we want to emphasize tht with the suggested training scheme (including a learnable scalar for logits while discarding it in prediction), the overall objective can be represented as: $\mathcal{H}(\text{softmax}(\mathbf{z}), \mathbf{y}) + \frac{\alpha}{2} || \mathbf{z} + c \mathbf{1} - \mathbf{q}^E||^2$. Denoting $\hat{\mathbf{z}} = \mathbf{z} + c \mathbf{1}$. The above objecte is same with $\mathcal{H}(\text{softmax}(\hat{\mathbf{z}}), \mathbf{y}) + \frac{\alpha}{2} || \hat{\mathbf{z}} - \mathbf{q}^E||^2$, due to the translation invariance of the softmax function. We note that the latter objective is exactly our training objective (Eq. (3) in our manuscript), because given both $\mathbf{z}$ and $c$ are learnable, learning $\mathbf{z} + c \mathbf{1}$ equals to learn $\hat{\mathbf{z}}$. Thus, the inclusion of $c$ should not affect our analysis. Nevertheless, your concerns may also be eased by the following explanation: >concern 2: I believe the inclusion of a scalar $c$ affects the optimization dynamic as you cannot control $c$. We can simply analyze the gradients from $||\mathbf{z} - \mathbf{q}^E||^2$ and $||\mathbf{z} + c \mathbf{1} - \mathbf{q}^E||^2$ w.r.t $\mathbf{z}$ to examine if the inclusion of a scalar $c$ will prevent $\mathbf{z}\_{\ast}$ from corresponding to $\mathbf{q}^E\_{\ast}$. For the basis, we have $\frac{\partial ||\mathbf{z} - \mathbf{q}^E||^2}{\partial \mathbf{z}} = 2(\mathbf{z} - \mathbf{q}^E)$: we will have $\mathbf{z} = \mathbf{q}^E$ when minimize $||\mathbf{z} - \mathbf{q}^E||^2$. In this situation, the basis of discrimination will not be compromised as $\mathbf{z}\_{\ast}$ corresponds to the largest value $\mathbf{q}^E\_{\ast}$ (vice versa for $\mathbf{z}\_{k}$ and $\mathbf{q}^E\_{k} \forall k \neq \ast$). For another term, we have $\frac{\partial ||\mathbf{z} + c \mathbf{1} - \mathbf{q}^E||^2}{\partial \mathbf{z}} = 2(\mathbf{z} + c \mathbf{1} - \mathbf{q}^E)$: we will have $\mathbf{z} = \mathbf{q}^E - c \mathbf{1}$ when minimize this term. Given $c$ being a scalar, we still have the largest value in $\mathbf{z}$ corresponds to the largest value in $\mathbf{q}^E$ ($\mathbf{z}\_{\ast}$ to $\mathbf{q}^E\_{\ast}$). Thus, we say that although the inclusion of $c$ can affect the learning process, it will not undermine the primary discrimination ability, since the largest value in $\mathbf{z}\_{\ast}$ is encouraged to approximate the largest value, and vice versa for $\mathbf{z}\_{k} \forall k \neq \ast$ in both terms. > concern 3: I believe adding a bias term $c$ to the model will affect the rescaling factor. Specifically, we have $\mathcal{F} = 1 - \alpha \frac{z_{\ast} + c - q\_{\ast}^E}{1 - q_{\ast}}$ which is monotone with respect to $q\_{*}^E$ however $|\mathcal{F}|$ is not monotone anymore. A3: We can observe that alghough including $c$ changes the value, $\mathcal{F}$ is sitll monotonic increased w.r.t $q\_{\ast}^E$. As training $\mathbf{z} + c \mathbf{1}$ is equal to train $\mathbf{z}$, we will still observe the same phonomenon in Figure 2 (e), where most of $\mathcal{F}$ smaller than $0$, and our analysis holds with the inclusion of $c$. Meanwhile, we'd like to underscore that while the suggested translation is interesting, it's not currently part of our design, and we cautiously assert that it is irrelevant of our task as there is no clear evidence that such translation is beneficial for generalization. As such, with full respect, we suggest that the reviewer consider justifying our work based on its relevant content. > concern 5: $q_{\ast} \ll z\_{\ast}$ is incorrect. $z^*$ can take on any negative or positive real number if there are no normalization or restriction conditions applied to it. Please note that we also have another classification loss $\mathcal{H}(\text{softmax}(\mathbf{z}), \mathbf{y})$ in the training obejective, which restricts $\mathbf{z}$ by encouraging $\mathbf{z}_{\ast} \gg \mathbf{z}\_{k} \forall k \neq \ast$. Given that it is natural to assume $\sum\_k \mathbf{z}\_{\ast} = 0$ (similar in [c]), we can obtain that $z\_{\ast} \gg 0$ (suppose $z\_{\ast} \approx N$ in this case, where $N$ is a large positive number), and $\mathbf{z}\_{k} \ll 0 \forall k \neq \ast$, which is a implicit resctriction for $\mathbf{z}$. While also being regularized by the logit regularization term, $z\_{\ast}$ will be balanced between $[q\_{\ast}, N]$, we thus say that without the loss of generality, we have $\mathbf{q}\_{\ast} \ll \mathbf{z}\_{\ast}$, which supports our analysis in A3. > concern 6: While I still find the choice of regularization unintuitive, I believe normalizing or restricting z values could address this issue. A6: $\mathbf{z}$ has already been restricted with the main classification loss, please see our response in A5 for concern 5. **************** # Response to Reviewer Qpvc We thank the reviewer for the positive comments on our experiments and theoretical analysis. Raised questions are answered below. > W1: LMoE requires training experts on source domains and a target model simultaneously. Its efficiency, particularly in terms of time costs compared with SOTA algorithms, needs further analysis and demonstration. - A1_part1: Computational cost Please refer to our clarification in the general response. - A1_part2: Training time comparison As per your suggestion, we compare the average training time (TT) (minutes) of one trial in the PACS dataset for different methods, and list the results in the following. Note that in DomainBed, some methods may use fewer updating steps for their main networks, smaller training batches, or fewer backward samples than the ERM method, thus requiring less training time than ERM. Inherited from the KD framework, our method requires more training time, this is bearable as our method does not require extra inference cost than the baseline. We will include the training time comparisons in our revised manuscript. | TT | | TT | | TT | | | ----- |- |---|----|- |---| | DANN| 17 | GroupDRO | 24 | MixStyle | 25 | | VREx| 17 | CORAL | 24 | SD | 25 | | Fishr | 17 | CDANN | 24 |CondCAD | 26 | | IRM | 18 | SagNet | 24 | MIRO | 31 | | Mixup | 18 | CAD | 24 | MLDG | 32 | | MTL| 18 | ARM | 24 | Ours | 38 | | MMD| 18 | RSC | 25 | Fish | 52 | | ERM (Baseline)| 24 | SelfReg | 25 | ITTA| 62| > L1: LMoE trains experts on source domains, but it appears that all experts share the same weight. However, since source domains may vary in importance, balancing different losses requires further analysis and discussion. A1: Please note that the different experts do not share weights in our implementation, thus unnecessary for the loss balancing. Please also refer to the pseudocode of our training process in the general response. > L2: As shown in Table 1, LMoE achieves only a slight advantage over other methods in classification tasks. A2: We want to clarify that under the rigorous evaluation protocol in DomainBed, improving an average of 2.7pp for the baseline ERM (58.1 v.s. 60.8) is not a slight advantage. This is because the DomainBed benchmark will conduct $3 \times 20$ (3 trials and each trial with 20 sets of randomly selected hyper-parameters) running for each method, and the performance in a target domain will be the average of the 3 trials each with best-performing hyper-parameter settings. In other words, all compared methods will be equipped with the most suitable hyper-parameter settings (i.e., lr, batch size, dropout, etc.,) for evaluation. We note that the previous SOTA (i.e., ITTA, SD, CORAL) can barely improve the baseline by 2pp, and almost half of existing methods are outperformed by the baseline ERM. Besides, our method improves ERM by nearly 8pp in the challenging TerraIncognita dataset. These results demonstrate the favorable improvements from our method. **************** Dear Reviewer Qpvc, We hope this finds you well. Thanks again for your insightful comments. As the deadline for discussion approaches, we want to ensure that we address any remaining uncertainties or questions you may have. We have thoroughly studied your comments and have made efforts to provide additional clarifications and experiments in our previous responses, particulay concerning the training time comparisons with other arts. We sincerely hope that you find our additional clarifications satisfactory and that they contribute to a clearer understanding of our framework. Please do not hesitate to contact us if there are other clarifications or experiments we can offer. Once again, we appreciate your time and attention to our work. Best regards, The Authors **************** # Response to Reviewer rg4W We appreciate the constructive feedback from the reviewer, the mentioned analysis will be included in our revised manuscript. > W1: Their method needs to train K+1 experts during training, which can require a lot of computation. I think if the existing methods are allowed to use such a large computation, their performance can improve and the effectiveness of this method will be reduced. - A1_part1: Computational usage. Please refer to the computational cost in our general response. - A1_part2: Existing methods with large computation resources. Please note that the existing Meta-DMoE method also uses a similar training pipeline that assigns data from different domains for corresponding experts (detailed in Ln 135 - 147). Despite more computational resources are leveraged than our method (because it involves training an extra transformer using meta-learning), their performance is inferior to ours (detailed in Sec. C and Table 7). As for your suggestion, we also conduct experiments by implementing some leading arts (i.e., CORAL, SD) and the baseline ERM with K+1 times of model size. Specifically, the feature extractor for the larger method will contain K+1 branches, each with the same pretrained ResNet backbone. We concate the final outputs from the different branches and use it as input for a classifier to obtain the final result. Note that in this setting, a sample will go through K+1 times of forward passes for both training and inference, which is more than that of LMoE design. Results are shown below. We note that compared to the results from the original models (in the bracket), when using the same pretrained knowledge, naively expanding model size cannot improve the performance. The reason may be that a well-pretrained small backbone can already saturate on limited training data (as shown in Table 5, ERM can achieve more than 0.96 acc in the source domains), thus it is unnecessary for using larger backbones in these datasets. | | PACS | VLCS | Office | Terra|Avg | :-------- | :-------------- | :----: | :----: | :----: |:----: | |ERM |79.8 $\pm$ 0.4 (79.8 $\pm$ 0.4)|75.7 $\pm$ 0.2 (75.8 $\pm$ 0.2)|60.8 $\pm$ 0.2 (60.6 $\pm$ 0.2)|39.5 $\pm$ 1.3 (38.8 $\pm$ 1.0)|64.0 (63.8)| |CORAL|81.9 $\pm$ 0.1 (81.7 $\pm$ 0.0) |75.7 $\pm$ 0.4 (75.5 $\pm$ 0.4) |62.7 $\pm$ 0.2 (62.4 $\pm$ 0.4) |41.8 $\pm$ 0.4 (41.4 $\pm$ 1.8)|65.5 (65.3)| |SD|82.2 $\pm$ 0.4 (81.9 $\pm$ 0.3) |76.0 $\pm$ 0.2 (75.5 $\pm$ 0.4) |62.6 $\pm$ 0.1 (62.9 $\pm$ 0.2) |41.7 $\pm$ 1.1 (42.0 $\pm$ 1.0)|65.6 (65.6)| |Ours|82.4 $\pm$ 0.1| 76.2 $\pm$ 0.1|63.2 $\pm$ 0.1 |46.3 $\pm$ 0.5|67.0| > W2_part1: It is reasonable to compare it with Model Soup (MS), which can obtain a single model by aggregating the weights of training multiple different models. A2_part1: We conduct experiments by training $K$ models on different domains and aggregating them to form a target model via the suggested weight combination scheme. Average aggregation is adopted as the prior distribution of the target data is unavailable. We use 2 settings to evaluate the effectiveness of the MS strategy, one with a shared classifier for all models (i.e., MS_S) and another with different classifiers for different models (i.e., MS_D). Results are listed below. We note that both the two Model Soup strategies fail to improve the baseline, which is similar to the observation from averaging the outputs of different experts (in Sec. 5.4). This is because the average aggregation is unrealistic in practice: the unknown target information may not be a simple average of the sources. Given their combination weights are difficult to determine, we thus turn to the proposed simple LMoE. | | PACS | VLCS | Office | Terra|Avg | :-------- | :----- | :----: | :----: | :----: |:----: | |ERM |79.8 $\pm$ 0.4|75.8 $\pm$ 0.2|60.6 $\pm$ 0.2|38.8 $\pm$ 1.0|63.8| |MS_S|76.5 $\pm$ 0.6 |73.4 $\pm$ 0.8|56.1 $\pm$ 0.6 |31.9 $\pm$ 1.2|59.5| |MS_D|72.7 $\pm$ 0.8| 71.8 $\pm$ 0.7|47.1 $\pm$ 1.0 |27.3 $\pm$ 1.6|54.7| |Ours|82.4 $\pm$ 0.1| 76.2 $\pm$ 0.1|63.2 $\pm$ 0.1 |46.3 $\pm$ 0.5|67.0| > W2_part2: It is not very clear if we really need an expert for each domain. An alternative way is independently training K models with all domains and applying distillation. Such analysis is crucial to argue the novelty of the approach. A2_part2: We would like to show that if we use all domains to train the K models, they will converge to be the same final model because they are with the same loss function and the same input. To examine the mentioned idea, we train a single model as the teacher with inputs from all source domains and conduct the same distillation for a target model (i.e., Single_distil). Results are listed in the following. We note that this strategy can also improve the baseline. This is because it can benefit from the two effects revealed in Sec. 3.2 (i.e., using more information and mining hard samples from the teacher). This vouches for the effectiveness of our logit regularization distillation scheme. However, this strategy performs inferior to LMoE in general. The reasons, as we analyzed in Sec. B.4, are that the hard samples from the experts contain more ambiguous data located in the mixed region of two domains than that from the model trained with all source domains. According to our analysis, this ambiguous data is more domain-agnostic and can better help generalization. This explains why experts must be involved in our framework. Please also refer to Sec. B.3 and B.4 for more visual and experimental proofs. | | PACS | VLCS | Office | Terra|Avg | :-------- | :----- | :----: | :----: | :----: |:----: | |ERM | 79.8 $\pm$ 0.4|75.8 $\pm$ 0.2|60.6 $\pm$ 0.2|38.8 $\pm$ 1.0|63.8| |Single_distil|81.8 $\pm$ 0.3 | 75.9 $\pm$ 0.6 |62.6 $\pm$ 0.2|44.2 $\pm$ 0.7|66.1| |Ours|82.4 $\pm$ 0.1|76.2 $\pm$ 0.1 |63.2 $\pm$ 0.1| 46.3 $\pm$ 0.5|67.0| > W3: Novelty. A3: Please refer to the clarification regarding comparisons with existing ideas in our general response. > Q1: I could not understand how $q^{E}$ is actually computed. A1: As stated in Line 155, $q^{E}$ represents probabilities from experts in all source samples, which is obtained by concating output probabilities from the corresponding experts (i.e., $q^{E_i}$) along the batch dimension. Please also see our pseudocode in the general response for details. **************** Dear Reviewer rg4W, Thanks again for your insightful feedbacks. As the deadline for discussion approaches, we want to ensure that we address any remaining uncertainties or questions you may have. We have thoroughly studied your comments and have made efforts to provide additional experiments and analyses in our previous responses. Regarding your main concerns (i.e., effectiveness of existing arts with similar training resources, comparing with the Model Soup aggregation scheme, and the importance of experts), we have conducted experiments using the exact experiments settings as suggested. Our findings align with the results from the similar analyses in our manuscript, which further validates the effectiveness of our design. We sincerely hope that you find our additional clarifications and experiments satisfactory. Please do not hesitate to contact us if there are other clarifications or experiments we can offer. Once again, we appreciate your time and attention to our work. Best regards, The Authors ****************

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.