# [Universal Neural Machine Translation for Extremely Low Resource Languages](https://aclanthology.org/N18-1032.pdf)
## Problem
Despite the success of training multi-lingual NMT systems; there are a couple of challenges to leverage them for zero-resource languages:
* **Lexical-level Sharing:** Conventionally, a multilingual NMT model has a vocabulary that represents the union of the vocabularies of all source languages. Therefore, the multi-lingual words do not practically share the same embedding space since each word has its own representation. This does not pose a problem for languages with sufficiently large amount of data, yet it is a major limitation for extremely low resource languages since most of the vocabulary items will not have enough, if any, training examples to get a reliably trained models.
* **Sentence-level Sharing:** It is also crucial for low- resource languages to share source sentence representation with other similar languages. For example, if a language shares syntactic order with another language it should be feasible for the low-resource language to share such representation with another high resourse language. It is also important to utilize monolingual data to learn such representation since the low or zero resource language may have monolingual resources only
## Key Idea
* Utilize multi-lingual neural translation system to share lexical and sentence level representations across multiple source languages into one target language
* The lexical sharing is represented by a universal word-level representation where various words from all source languages share the same underlaying representation
* The sentence-level sharing is represented by a model of language experts which enables low-resource languages to utilize the sentence representation of the higher resource languages
## Approach
* We propose a Universal NMT system that is focused on the scenario where minimal parallel sentences are available. As shown in Fig. 2, we introduce two components to extend the conventional multi-lingual NMT system (Johnson et al., 2017): Universal Lexical Representation (ULR) and Mixture of Language Experts (MoLE) to enable both word-level and sentence-level sharing, respectively
* We propose a novel representation for multi-lingual embedding where each word from any language is represented as a probabilistic mixture of universal-space word embeddings. In this way, semantically similar words from different languages will naturally have similar representations. Our method achieves this utilizing a discrete (but probabilistic) “universal token space”, and then learning the embedding matrix for these universal tokens directly in our NMT training.
### Universal Lexical Representation (ULR)
* It is not straightforward to have a universal representation for all languages.
* One potential approach is to use a shared source vocabulary, but this is not adequate since it assumes significant surface-form overlap in order being able to generalize between high-resource and low-resource languages.
* Alternatively, we could train monolingual embeddings in a shared space and use these as the input to our MT system.
* However, since these embeddings are trained on a monolingual objective, they will not be optimal for an NMT objective.
* If we simply allow them to change during NMT training, then this will not generalize to the low-resource language where many of the words are unseen in the parallel data.
* Therefore, our goal is to create a shared embedding space which
* is trained towards NMT rather than a monolingual objective,
* is not based on lexical surface forms, and
* will generalize from the high-resource languages to the low-resource language.
#### An Example
* To give a concrete example, imagine that our
* Target language is English (En)
* Our high-resource auxiliary source languages are Spanish (Es) and French (Fr)
* Our low-resource source language is Romanian (Ro).
* En is also used for the universal token set.
* We assume to have 10M+ parallel Es-En and Fr-En, and a few thousand in Ro-En.
* We also have millions of mono-lingual sentences in each language.
* We first train word2vec embeddings on mono-lingual corpora from each of the four languages.
* We next align the Es-En, Fr-En, and Ro-En parallel corpora and extract a seed dictionary of a few hundred words per language, e.g., `gato → cat`, `chien → dog`.
* We then learn three matrices `O1`,`O2`,`O3` to project the Es, Fr and Ro embeddings (E<sup>Q1</sup>,E<sup>Q2</sup>,E<sup>Q3</sup>), into En (E<sup>K</sup>) based on these seed dictionaries.
* At this point, Equation 5 should produce reasonable alignments between the source languages and En, e.g., `q(horse|magar) = 0.5`, `q(donkey|magar) = 0.3`, `q(cow|magar) = 0.2`, where `magar` is the Ro word for `donkey`.
### MoLE
* As we paved the road for having a universal embedding representation; it is crucial to have a language-sensitive module for the encoder that would help in modeling various language structures which may vary between different languages. We propose a Mixture of Language Experts (MoLE) to model the sentence-level universal encoder. As shown in the figure below, additional module of mixture of experts is used after the last layer of the encoder.
* Similar to (Shazeer et al., 2017), we have a set of expert networks and a gating network to control the weight of each expert. More precisely, we have a set of expert networks as f1(h), ..., fK(h) where for each expert, a two-layer feed-forward network which reads the output hidden states h of the encoder is utilized. The output of the MoLE module h0 will be a weighted sum of these experts to replace the encoder’s representation, where an one-layer feed-forward network g(h) is used as a gate to compute scores for all the experts. In our case, we create one expert per auxiliary language.
* We do not update the MoLE module for training on a sentence from the low-resource language. Intuitively, this allows us to represent each token in the low-resource language as a context-dependent mixture of the auxiliary language experts.
## Architecture
*
|  |
| -------- |
| Figure: An illustration of the proposed architecture of the ULR and MoLE. Shaded parts are trained within NMT model while unshaded parts are not changed during training. |
* We implement an attention-based neural machine translation model which consists of a one-layer bidirectional RNN encoder and a two-layer attention-based RNN decoder. All RNNs have 512 LSTM units (Hochreiter and Schmidhuber, 1997). Both the dimensions of the source and target embedding vectors are set to 512. The dimensionality of universal embeddings is also the same. For a fair comparison, the same architec-ture is also utilized for training both the vanilla and multilingual NMT systems. For multilingual exper-iments, 1 ∼5 auxiliary languages are used. When training with the universal tokens, the temperature τ is fixed to 0.05 for all the experiments.
* All the models are trained to maximize the log-likelihood using Adam (Kingma and Ba, 2014) optimizer for 1 million steps on the mixed dataset with a batch size of 128. The dropout rates for both the encoder and the decoder is set to 0.4. We have open-sourced an implementation of the proposed model.
## Experiments
We extensively study the effectiveness of the proposed methods by evaluating on three “almost-zero-resource” language pairs with variant auxiliary languages. The vanilla single-source NMT and the multi-lingual NMT models are used as baselines
### Dataset
|  |
| -------- |
| Table 1: Statistics of the available parallel resource in our experiments. All the languages are translated to English |
### Preprocessing
* Preprocessing All the data (parallel and mono-lingual) have been tokenized and segmented into subword symbols using byte-pair encoding (BPE) (Sennrich et al., 2016b).
* We use sentences of length up to 50 subword symbols for all languages. For each language, a maximum number of 40,000 BPE operations are learned and applied to restrict the size of the vocabulary.
* We concatenate the vocabularies of all source languages in the multilingual setting where special a "language marker" have been appended to each word so that there will be no embedding sharing on the surface form.
* Thus, we avoid sharing the representation of words that have similar surface forms though with different meaning in various languages
## Results
|  |  |
| ------------------------------------ | --------------------------------------- |
| Figure: BLEU score vs corpus size | Figure : BLEU score vs unknown tokens |
|  |  |
| ------------------------------------ | --------------------------------------- |
| Table: Scores over variant source languages (6k sentences for Ro & Lv, and 10k for Ko). “Multi" means the Multi-lingual NMT baseline. | Table: BLEU scores evaluated on test set (6k), compared with ULR and MoLE. “vanilla" is the standard NMT system trained only on Ro-En training set |
## Conclusion
* In this paper, we propose a new universal machine translation approach that enables sharing resources between high resource languages and extremely low resource languages.
* Our approach is able to achieve 23 BLEU on Romanian-English WMT2016 using a tiny parallel corpus of 6k sentences, compared to the 18 BLEU of strong multilingual baseline system.