# **Reviewer jc8P**
**Rebuttal:**
Dear reviewer, we would like to thank you for your time, effort, and thorough reviews! We are really encouraged by the reviews that highlight our work:
(1) Addresses an interesting and important issue for the machine translation community by providing in-depth study and extensive analyses;
(2) The experimental setting is grounded and the results confirm the claims being made;
(3) The main findings of the paper are summarized and possible future directions are identified;
(4) The novel dataset is an important contribution.
We have responded to your comments below and updated our revision accordingly. If you still have additional questions and suggestions, we would be happy to answer your questions and incorporate the suggestions into the paper.
---
> **Reasons To Reject 1, 2, and 4.** Some of the key concepts used in the paper are not properly introduced. Some relevant results are reported in the Appendix. Can this part be moved in the Appendix?
Thank you for offering this guidance. We have already improved the presentation of them accordingly in our revision. We really appreciate your feedback and we believe the paper is strengthened than the previous version. If you still have further questions or guidance, we will gladly answer them and include them in our updated version.
---
> **Reasons To Reject 3.** The vocabulary overlap part is an important contribution to the paper, but its setting is not clear. In Section 5.2.2 lines 509-513, it is not clear what correlation is computed and it seems to refer only to the source language. This part should be clarified and improved.
We thank you for the feedback. As outlined in line 495, in the whole paper, we measure vocabulary overlap in the same way, i.e., the extent of shared subwords between the source and target languages.
When solely studying whether a higher overlap correlates with improved ZS performance (line 503-509), we fixed the target language. e.g.: the target language is German, source side is other 39 languages; then we compute the correlation score using these 39 overlap scores and corresponding 39 ZS performances.
When incorporating the vocabulary overlap feature with English translation capability (En_performance) in the regression experiment (Table.6), there's no need to fix the target language.
---
> **Reasons To Reject 5.** The MT evaluation are all string-matching metrics. If the use of COMET could make the findings more robust.
Thank you for this advice. We have conducted the COMET evaluations and ran all analyses based on COMET's supported languages (35/41 including English). **In sum, the findings based on COMET for all sections are consistent with previous analyses. and we believe this further makes our paper more solid.** Below we provide the main evidence, and we have already added them in the revision.
#### Table 10: Resource-level analysis based on Comet score. This result is consistent with Table 3 in page 5. Our main conclusion in this part remains the same: the resource level of the target language has a stronger effect on ZS translation qualities.
| | | Target | | | | | |
|------ |------|:----------:|-------|-------|-------|-------|-------|
| | | En | High | Med | Low | e-Low | Avg |
| Source| En | - | 0.84 | 0.83 | 0.79 | 0.70 | 0.79 |
| | High | 0.84 | 0.68 | 0.70 | 0.61 | 0.51 | 0.63 |
| | Med | 0.81 | 0.66 | 0.67 | 0.60 | 0.50 | 0.61 |
| | Low | 0.77 | 0.62 | 0.64 | 0.54 | 0.46 | 0.56 |
| |e-Low | 0.72 | 0.58 | 0.60 | 0.53 | 0.44 | 0.54 |
| | Avg | 0.78 | 0.64 | 0.65 | 0.57 | 0.48 | 0.59 |
#### Table 11: Analysis of zero-shot performance considering data size and English-centric performance based on Comet score. The result is consistent with Table 4 on page 6. Our main conclusion in this part remains the same: Out-of-English translation quality correlates with zero-shot performance the most. We also observed the data-size feature can even harder to explain the zero-shot variations (with low R-square and high MAE, and RMSE scores).
| | | Src_size | Tgt_size | Src->en | en->Tgt |
|-----------------|----------------|-----------|----------|------------------|----------|
| Correlation | Pearson | 0.25 | 0.43 | 0.53 | 0.66 |
| | Spearman | 0.31 | 0.47 | 0.46 | 0.60 |
| | | Data-size | En-centric perf. |
|-----------------|----------------|----------------|----------------------|
| Regression | R-square | 4.40% | 69.53% |
| | MAE | 11.08 | 7.35 |
| | RMSE | 13.67 | 9.04 |
#### Table 12: Prediction of ZS Performance using En-Centric performance, vocabulary overlap, and linguistic properties. We present the result based on the mT-large and the Comet score in this table. This result is consistent with Table 6 on page 7. Our main conclusion in this section remains the same: Vocabulary overlap and linguistic features can further explain the ZS variations.
| ID | Features | R-square | MAE | RMSE |
|-------|---------------------------|----------|---------|---------|
| 4 | En_performance | 69.53% | 11.08 | 13.67 |
| 5 | 4 + Vocab-Sim | 77.22% | 6.87 | 8.49 |
| 6 | 5 + Linguistic-features | 79.51% | 6.63 | 8.30 |
---
> **Question 1.** Page 3 lines 238-240: What does this sentence mean?
This sentence means for each resource, we control the number of languages and number of training sentences the same.
* Are number of languages in each resource level same?
* Yes, 10 languages for each resource.
* Are number of languages in each language family same?
* Yes, 8 languages for each family.
* Are number of training sentences in each resource level same?
* Yes, e.g.: 5M for high-resource, and 1M for Medium-resource, and ect.
---
> **Question 2.** Page 4 lines 266-269: Is the cleaning step performed before or after selecting the data in the EC40 dataset? If it is performed after the creation of the dataset, how can you guarantee that the data distribution is not modified?
The cleaning step is before selecting the data in the EC40 dataset. This means all the number of training sentences reported in Table 7 (page 12) are after the cleaning, therefore our models trained using the exact numbers we reported in the Table 7. Please also check the steps we list below.
Download the data from OPUS -> Do the data cleaning step -> Sample data from the cleaned set -> Learn vocabulary and train models
---
> **Question 3.** What are the numerator and denominator in the CV equation?
The Coefficient of Variation $CV = \frac{\sigma}{\mu}$ is defined as the ratio of the standard deviation (numerator: $\sigma$) to the mean (denominator: $\mu$) of zero-shot performance.
---
> **Question 4.** Page 7, why do the authors use RMSE in Table 6 but only R-square and MAE in Table 4? What is the advantage of adding RMSE in Table 6?
Adding RMSE in our expeirments instead of only using MAE mainly because of two points:
* Sensitivity to Large Errors:RMSE gives more weight to larger errors due to the squaring operation.
* Outlier Impact: Because RMSE squares the differences, it is more sensitive to outliers than MAE.
Therefore, we incorporated two metrics in the Table 6 for a more comprehensive analysis. We appreciate your advice and we further present the RMSE for Table 4 (see below) and updated this in the revision to make the presentation more consistent.
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
overflow:hidden;padding:10px 5px;word-break:normal;text-align:center;} /* Added text-align:center; for center alignment */
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;text-align:center;} /* Added text-align:center; for center alignment */
.tg .tg-0pky{border-color:inherit;text-align:center;vertical-align:top} /* Added text-align:center; for center alignment */
.tg .tg-0lax{text-align:center;vertical-align:top} /* Added text-align:center; for center alignment */
</style>
<table class="tg-0pky">
<thead>
<tr>
<td class="tg-0pky" colspan="2" rowspan="2">Metric/Features</td>
<th class="tg-0pky" colspan="2">Data-size</th>
<th class="tg-0pky" colspan="2">En-centric perf.</th>
</tr>
<tr>
<td class="tg-0pky">Src_size</td>
<td class="tg-0pky">Tgt_size</td>
<td class="tg-0pky">Src->en</td>
<td class="tg-0pky">en->Tgt</td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky" rowspan="3">Regression</td>
<td class="tg-0pky">R-square</td>
<td class="tg-0pky" colspan="2">32.54%</td>
<td class="tg-0pky" colspan="2">61.34%</td>
</tr>
<tr>
<td class="tg-0pky">MAE</td>
<td class="tg-0pky" colspan="2">5.47</td>
<td class="tg-0pky" colspan="2">4.70</td>
</tr>
<tr>
<td class="tg-0pky">RMSE</td>
<td class="tg-0pky" colspan="2">6.62</td>
<td class="tg-0pky" colspan="2">5.59</td>
</tr>
</tbody>
</table>
| | | Data-size | En-centric perf. |
|--------------|---------------------|-----------|----------|
|Regression | R-square | 32.54% | 61.34% |
| | MAE | 5.47 | 4.70 |
| | RMSE | 6.62 | 5.59 |
# **Reviewer 2RvR**
**Rebuttal:**
Dear reviewer, we would like to thank you for your time, effort, and thorough reviews! We are really encouraged by the reviews that highlight our work:
1. conduct a large multilingual translation dataset.
2. Explore the influence of factors on the performance of zero-shot translation.
3. conduct extensive experiments.
We have responded to your comments below and updated our revision accordingly. If you still have additional questions and suggestions, we would be happy to answer your questions and incorporate the suggestions into the paper.
---
> **Reasons To Reject 1.** What is the difference between Sacrebleu and SpBleu in the experiments? And are their scores in different translation directions comparable?
Thank you for the question. To address your question, it's important to note that our paper does not involve cross-metric comparisons. Secondly, our approach focuses on providing a comprehensive analysis, incorporating various metrics is to ensure a thorough evaluation. Specifically, we incorporate the following metrics: ChrF++ (character level), SpBleu (tokenized sub-word level), SacreBleu (detokenized word level), and COMET (representation level).
Our intention in employing these multiple metrics is to ensure a holistic assessment of our experiments. Notably, we maintain consistency in our findings across all these metrics, which lends robustness to our conclusions. To delve further into the analysis based on COMET, we kindly direct you to our responses in the context of *"Reasons To Reject 5"* where more detailed insights are provided, specifically addressing the concerns raised by reviewer jc8P.
---
> **Reasons To Reject 2.** Why EC40 is balanced, while EC40's scale is from 50K to 5M and the Script of most languages is Latin?
Thank you for raising this query. We pointed out EC40 is balanced of resources and languages (line 218-221) instead of the number of training sentences among the whole dataset or type of scripts. Please check the Sec.A.1 for more details in Appendix. Specifically, the EC40 is carefully balanced across various scales (from line 234-240):
1. For each resource, the number of languages are the same (10 languages).
2. For each resource, the number of training sentences are balanced (line 238-240), e.g.: 5M for all High-resource languages)
3. For each language family, we incorporate 8 languages (line 236-238).
---
> **Reasons To Reject 3.** Why the m2m-100 model could be evaluated directly, without any fine-tuning? No gap between itself train set and EC40?
Thank you for your question. **Our goal of introducing the m2m-100 models is to investigate whether the high variations in the ZS performance phenominon holds across models.** Specifically, m2m-100 is a trained multilingual translation model using 7.5B parallel sentences covering 2200 training directions, whereas our EC40 contains only 66M sentences. Moreover, m2m-100 is a strong translation model by achieving SOTA results on multiple translation directions including both supervised and zero-shot ones [1].
**An additional noteworthy point is the alignment of domain factors.** Our validation and test sets encompass data from diverse domains, a characteristic shared with the m2m-100 model's training data, which includes content from multiple domains such as CCMatrix. This feature contributes to a smaller domain mismatch between the m2m-100 model and our test set, further justifying the direct evaluation approach.
---
> **Reasons To Reject 4.** What is referred to in the CV metric for zero-shot in Table 1? And why the highest score is noted by underline?
Thank you for your inquiry. **In the context of Table 1, the reference is to the Coefficient of Variation ($CV = \frac{\sigma}{\mu}$), which serves as a metric used to quantify the magnitude of variation in zero-shot translation performances when compared to supervised translation directions.** The purpose of utilizing this metric is outlined in lines 349-355 of the paper. Specifically focusing on Table 1, the Coefficient of Variation (CV) scores are computed for various machine translation (MT) metrics, namely Sacrebleu, Chrf++, and Spbleu. These scores are calculated under different settings, such as language pairs denoted as En→X.
Notably, in Table 1, the highest CV score for a particular metric and setting is indicated by being underlined. This visual emphasis (underline) draws attention to the translation scenarios where the variation between zero-shot directions is more pronounced than supervised ones, helping to highlight significant findings of high variations.
---
> **Reasons To Reject 5.** What is the inner relationship among these aspects (English-centric capacity, vocabulary overlap, linguistic properties, and off-target issue)
Thank you for this question. **While investigating correlations between these factors could indeed provide valuable insights, the primary objective of our study is to uncover their relationships with our target variable: zero-shot translation ability.** The main focus of the paper is to shed light on how these factors interpret high variations in zero-shot translation performance.
Morover, we discuss one intuitively strong relationship here. One of the intuitive connections lies between vocabulary overlap and linguistic properties. For instance, languages that share the same writing system might exhibit a higher vocabulary overlap due to lexical similarities. This part is explicitly explored in paper (line 382-488), highlighting that vocabulary overlap serves as a more fundamental indicator of surface-level similarity compared to more intricate linguistic metrics such as language family or typology distance.
---
> **Reasons To Reject 6.** Why switch the metric to 'Sacrebleu' in Sec.5.4?
Thank you for your question. I assume you mean why use Sacrebleu in Sec.5.3 instead of Sec.5.4? Please let us know if this is not true.
We utilized Sacrebleu in Sec. 5.3 to align with the setup employed by [2] (the paper hypothesized the off-target issue is the root cause of the low ZS performance). we further provide the measurement based on Chrf++ and Spbleu below.
| | Sacrebleu | Chrf++ | SpBleu|
|-----------|-----------|--------| ------|
| Spearman | -0.07 | -0.07 | -0.08 |
| Pearson | -0.06 | -0.03 | -0.07 |
---
References:
[1] Fan, Angela, et al. "Beyond english-centric multilingual machine translation."
[2] Zhang, Biao, et al. "Improving massively multilingual neural machine translation and zero-shot translation."
# **Reviewer 3cCd**
**Rebuttal:**
Dear reviewer, we would like to thank you for your time, effort, and thorough reviews! We are really encouraged by the reviews that highlight our work:
1. Design extensive and comprehensive experiments to delve into the phenomenon of high variations in zero-shot MT.
2. The conclusions are general regardless of the resource level of languages.
3. A new dataset is suitable for academic research.
4. Propose a new point of view that the off-target problem is a symptom.
We have responded to your comments below and updated our revision accordingly. If you still have additional questions and suggestions, we would be happy to answer your questions and incorporate the suggestions into the paper.
> **Reasons To Reject 1.** The three factors of variations are intuitive and have already been observed in previous works. No novel phenomena were found.
We respectfully hold a different perspective on this matter. To the best of our knowledge, this paper represents the inaugural attempt that comprehensively and systematically study the impact of these three factors on zero-shot translation qualities in depth. **Our approach distinctly sets us apart from previous research, and we invite you to assess the recent relevant works that we have collated for comparison (all these references have also been introduced in our paper).**
| Paper | Task | Descriptions |
| -------------------------------------------------------- | ----------------- | --------------------------------------------------------------------------------------------------------- |
| Zhang et al. (2020) | Zero-Shot NMT | Investigates how modeling capacity affects ZS performance |
| Wu et al. (2021); Raganato et al., (2021); Wang et al. (2021) | Zero-Shot NMT | MNMT models are prone to forget the language label information |
| Chen et al. (2022); Gu et al. (2019); Tang et al. (2021); Wang et al. (2021) | Zero-Shot NMT | Examines the impact of model initialization on ZS performance|
| Pan et al., 2021; Ji et al., (2020); Wu et al. (2021); | Zero-Shot NMT | Recognize the inconsistency of semantic representation in different languages |
| Aharoni et al. (2019) | Zero-Shot NMT | Suggests that adding more languages in an MNMT system improves zero-shot performance more for close language pairs |
| Pires et al., (2019) | Cross-lingual NLU | Concludes that transfer is more successful for languages with high lexical overlap and typological similarity |
| Lauscher et al., (2020) | Cross-lingual NLU | Concludes that transfer is more successful for syntactically or phonologically similar languages |
**We would like to reiterate the novelty and contribution of our research concerning the three factors, and we wholeheartedly invite you to substantiate your claims with references for a more profound discussion.**
1. Our identification of the considerable influence of out-of-English translation quality on zero-shot variation is novel, supported by detailed insights provided in lines 566 to 574. We further provide insight into how to use this finding to improve the zero-shot NMT in section 5.4.
2. We found that performance in the close language pairs (with sane linguistic properties) is significantly better than distant pairs regardless of resource level, and this phenomenon is especially pronounced in smaller models. We further stress the importance of focusing efforts on distant language pairs rather than exclusively concentrating on resource-rich pairs, as prior ZS-NMT research has done (lines 575-586).
3. We have consistently observed that vocabulary overlap plays a significant role in explaining zero-shot variations, encouraging greater cross-lingual transfer and knowledge sharing via better vocabulary sharing has the potential to enhance zero-shot translations (lines 587-600).
---
> **Question 1.** Why do translations from/to medium-resource languages have better quality than that from/to high-resource languages according to Table 3?
Thank you for raising this query. This phenomenon is observed in other En-centric MNMT experiments, as evidenced by the studies conducted by [1] and [2].
Furthermore, our own experiments have revealed that certain medium-resource languages, like Afrikaans, demonstrate the highest performance across all metrics. This exceptional performance of Afrikaans contributes significantly to raising the overall average score within the medium-resource group. This perspective is consistent with earlier research, which highlights Afrikaans' extensive lexical and syntactic borrowings from English, as well as its close linguistic affiliations with multiple neighboring languages of English [3].
---
> **Reasons To Reject 2 and Question 2. ** High variation is not the cause for low average zero-shot performance. Their relationship and the root cause of low performance are not discussed in depth. While the paper notes that high variation is not the sole cause of low average zero-shot performance, the relationship between these factors and the root cause of low performance is not discussed in depth. Could the authors provide further insights into the underlying causes of low performance and how they relate to the factors of variation?
Thank you for your thoughtful feedback. We acknowledge the universally recognized challenge of zero-shot performance consistently falling behind supervised performance in the broad domain of deep learning. The community as a whole is deeply engaged in comprehending this issue and actively working to enhance it. In the context of Multilingual Machine Translation, **recent research has focused on seeking explanations for the causes of overall low ZS performance and trying to enhance it.** Please refer to the literature and descriptions we demonstrated in *Reasons To Reject 1* and Sec.2.2 in the paper.
**In contrast, our study introduces a fresh perspective within zero-shot NMT: the presence of high variations.** This implies that ZS directions don't uniformly suffer from low performance; instead, some directions exhibit decent results. Specifically, it is evidenced by certain ZS directions exhibit promising performance, with approximately 70% of ZS-condition-1 achieving SpBleu scores surpassing 20 (see Figure 1 on page 1). In stark contrast, merely 10% of the remaining directions achieve comparable SpBleu scores. This intriguing phenomenon prompted us to further understand the underlying factors driving this extensive variation in zero-shot NMT.
We recognize that our investigation of high variations in zero-shot performance adds an important layer of insight to the discourse surrounding zero-shot NMT, which provides an additional perspective than understanding the root causes of overall poor performance in zero-shot scenarios. Drawing from our findings, we have identified three key factors responsible for driving these variations and have subsequently offered insights to enhance zero-shot NMT performance. Of note, our conclusions emphasize the significance of resource-lean and distant language pairs as a focus for improving zero-shot performance. Crucially, this insight aligns with, rather than contradicts, existing efforts aimed at improving performance in resource-rich language pairs, while also addressing the aspect of language relatedness that has often been overlooked.
---
References:
[1] Zhang, Biao, et al. "Improving massively multilingual neural machine translation and zero-shot translation."
[2] Yang, Yilin, et al. "Improving multilingual translation by representation and gradient regularization."
[3] Zhou, Zhong, and Alex Waibel. "Family of origin and family of choice: Massively parallel lexiconized iterative pretraining for severely low resource text-based translation."
---
---
We acknowledge that our exploration is another important observation for the zero-shot NMT apart from just try to understand why the general zero-shot performance is low. Based on our findings, we recognized three main factors that explain the high variations and provided corresponding insights to improve the ZS NMT performance. One of the important takeaways is: attention should be paid more on resource-lean and distant language pairs to improve the general ZS performance. This does not conflict with prior contributions on how to improve resource-rich language pairs.
Dear Area Chair,
Thank you for overseeing the review process for my submission to EMNLP. I am writing to provide feedback on some particular comments made by Reviewer 3cCd.
> Reasons To Reject1: The three factors of variations are intuitive and have already been obeserved in previous works. No novel phenomena were found.
While I deeply appreciate the reviewer's insights, I wish to draw attention to the following observations:
1. This particular comment appears to be in conflict with the guidelines outlined in the ACL/EMNLP review policy (Section 5.A), which emphasizes the importance of specificity in reviews. It is noted that missing references should be explicitly mentioned to provide clarity.
2. The comment also falls under the category outlined in Section 5.B of the review policy, where reviews might unintentionally adhere to heuristics such as "not novel" or "not surprising," without providing substantial evidence or references to support these claims.
Furthermore, I would like to highlight that Reviewer jc8P had a divergent opinion on our work. They found our findings to be insightful, indicating a difference in perspective regarding the novelty and contributions of our research.
With this context in mind, I kindly request your consideration of these points while reaching a final decision. we believes that our work holds valuable contributions for the field. We sincerely hope that the review process can reflect the comprehensive range of feedback provided by all reviewers. Thank you for your time and consideration. I appreciate your dedication to ensuring a fair and thorough review process.