Language-Agnostic BERT Sentence Embedding

# Language-Agnostic BERT Sentence Embedding Language-Agnostic BERT Sentence Embedding Tuesday, August 18, 2020 Posted by Yinfei Yang and Fangxiaoyu Feng, Software Engineers, Google Research A multilingual embedding model is a powerful tool that encodes text from different languages into a shared embedding space, enabling it to be applied to a range of downstream tasks, like text classification, clustering, and others, while also leveraging semantic information for language understanding. Existing approaches for generating such embeddings, like LASER or m~USE, rely on parallel data, mapping a sentence from one language directly to another language in order to encourage consistency between the sentence embeddings. While these existing multilingual approaches yield good overall performance across a number of languages, they often underperform on high-resource languages compared to dedicated bilingual models, which can leverage approaches like translation ranking tasks with translation pairs as training data to obtain more closely aligned representations. Further, due to limited model capacity and the often poor quality of training data for low-resource languages, it can be difficult to extend multilingual models to support a larger number of languages while maintaining good performance. ![](https://i.imgur.com/bPwct51.png)