# Token Compression 語系比較
目標是找到一個語系中,有弱勢的語言跟強勢的語言,比對翻譯成同語系中強勢語言或直接暴力翻譯成英文後,何者會有更好的Tokenization效率跟更好的語意準確性。
## Cloud Translation API

### 語境相似計算
#### Model : XLM-R(Cross-lingual Language Model-RoBERTa)
#### CosSim
分別比對翻譯成的結果跟原文的語意相似性。
### Token Compression程度計算
$Compression\space rate = \frac{翻譯後句子Token數}{原文句子Token數}$
## 測試結果
### 羅馬尼亞語 Dataset:
#### RO-STS: Romanian Semantic Textual Similarity Dataset
[GitHub](https://github.com/dumitrescustefan/RO-STS/tree/master)
### 緬甸語 Dataset:
#### mySentence: Corpus and models for Burmese (Myanmar language) Sentence Segmentation
[GitHub](https://github.com/ye-kyaw-thu/mySentence#Corpus-Annotation)
### (印歐語系/義大利語族/羅曼語族/東羅曼語支)羅馬尼亞語->
### (印歐語系/義大利語族/羅曼語族/西羅曼語支)法文/英文

### (印歐語系/義大利語族/羅曼語族/東羅曼語支)羅馬尼亞語->
### (印歐語系/義大利語族/羅曼語族/西羅曼語支)西班牙文/英文


### (漢藏語系/藏緬語族)緬甸語-> (漢藏語系/漢語族)中文/英文

### Code
```python!
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from google.cloud import translate_v2 as translate
import matplotlib.pyplot as plt
from tqdm import tqdm
import seaborn as sns
import pandas as pd
import tiktoken
import random
import html
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS']= r'translation_key.json'
model = SentenceTransformer('xlm-r-bert-base-nli-stsb-mean-tokens')
translate_client = translate.Client()
def translate_text(text, target_language):
result = translate_client.translate(text, target_language=target_language)
return html.unescape(result['translatedText'])
def num_tokens_from_string(string, encoding_name):
encoding = tiktoken.get_encoding(encoding_name)
return len(encoding.encode(string))
with open("RO.dev.ro", "r", encoding="utf-8") as file:
lines = file.readlines()
lines = random.sample(lines, 200)
data = []
for original_text in tqdm(lines, desc="Processing"):
translated_text_zh = translate_text(original_text, "fr")
translated_text_en = translate_text(original_text, "en")
ori_num_tokens = num_tokens_from_string(original_text, "cl100k_base")
num_tokens_zh = num_tokens_from_string(translated_text_zh, "cl100k_base")
num_tokens_en = num_tokens_from_string(translated_text_en, "cl100k_base")
original_text_embedding = model.encode(original_text)
translated_text_embedding_zh = model.encode(translated_text_zh)
translated_text_embedding_en = model.encode(translated_text_en)
cos_sim_zh = cosine_similarity([original_text_embedding], [translated_text_embedding_zh])[0][0]
cos_sim_en = cosine_similarity([original_text_embedding], [translated_text_embedding_en])[0][0]
data.append({
"Original Text": original_text,
"Translated Text (FR)": translated_text_zh,
"Translated Text (EN)": translated_text_en,
"Token Count (Original)": ori_num_tokens,
"Token Count (FR)": num_tokens_zh,
"Token Count (EN)": num_tokens_en,
"Token Ratio (FR)": num_tokens_zh / ori_num_tokens,
"Token Ratio (EN)": num_tokens_en / ori_num_tokens,
"Cosine Similarity (FR)": cos_sim_zh,
"Cosine Similarity (EN)": cos_sim_en
})
df = pd.DataFrame(data)
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(data=df[["Token Ratio (FR)", "Token Ratio (EN)"]])
plt.title("Token Ratio Distribution")
plt.subplot(1, 2, 2)
sns.boxplot(data=df[["Cosine Similarity (FR)", "Cosine Similarity (EN)"]])
plt.title("Cosine Similarity Distribution")
plt.tight_layout()
plt.show()
df.to_csv("fr_translation_analysis_romania.csv", index=False)
```