# Token Compression 語系比較 目標是找到一個語系中,有弱勢的語言跟強勢的語言,比對翻譯成同語系中強勢語言或直接暴力翻譯成英文後,何者會有更好的Tokenization效率跟更好的語意準確性。 ## Cloud Translation API ![截圖 2023-12-29 下午2.39.16](https://hackmd.io/_uploads/HJaAYk3wT.png) ### 語境相似計算 #### Model : XLM-R(Cross-lingual Language Model-RoBERTa) #### CosSim 分別比對翻譯成的結果跟原文的語意相似性。 ### Token Compression程度計算 $Compression\space rate = \frac{翻譯後句子Token數}{原文句子Token數}$ ## 測試結果 ### 羅馬尼亞語 Dataset: #### RO-STS: Romanian Semantic Textual Similarity Dataset [GitHub](https://github.com/dumitrescustefan/RO-STS/tree/master) ### 緬甸語 Dataset: #### mySentence: Corpus and models for Burmese (Myanmar language) Sentence Segmentation [GitHub](https://github.com/ye-kyaw-thu/mySentence#Corpus-Annotation) ### (印歐語系/義大利語族/羅曼語族/東羅曼語支)羅馬尼亞語-> ### (印歐語系/義大利語族/羅曼語族/西羅曼語支)法文/英文 ![截圖 2023-12-29 上午10.54.17](https://hackmd.io/_uploads/H1uNO1hwa.png) ### (印歐語系/義大利語族/羅曼語族/東羅曼語支)羅馬尼亞語-> ### (印歐語系/義大利語族/羅曼語族/西羅曼語支)西班牙文/英文 ![截圖 2023-12-29 上午10.50.41](https://hackmd.io/_uploads/H1uV_J3PT.png) ![截圖 2023-12-28 下午11.58.22](https://hackmd.io/_uploads/Hy_N_1nwp.png) ### (漢藏語系/藏緬語族)緬甸語-> (漢藏語系/漢語族)中文/英文 ![截圖 2023-12-28 下午5.56.58](https://hackmd.io/_uploads/HydNuk3P6.png) ### Code ```python! from sklearn.metrics.pairwise import cosine_similarity from sentence_transformers import SentenceTransformer from google.cloud import translate_v2 as translate import matplotlib.pyplot as plt from tqdm import tqdm import seaborn as sns import pandas as pd import tiktoken import random import html import os os.environ['GOOGLE_APPLICATION_CREDENTIALS']= r'translation_key.json' model = SentenceTransformer('xlm-r-bert-base-nli-stsb-mean-tokens') translate_client = translate.Client() def translate_text(text, target_language): result = translate_client.translate(text, target_language=target_language) return html.unescape(result['translatedText']) def num_tokens_from_string(string, encoding_name): encoding = tiktoken.get_encoding(encoding_name) return len(encoding.encode(string)) with open("RO.dev.ro", "r", encoding="utf-8") as file: lines = file.readlines() lines = random.sample(lines, 200) data = [] for original_text in tqdm(lines, desc="Processing"): translated_text_zh = translate_text(original_text, "fr") translated_text_en = translate_text(original_text, "en") ori_num_tokens = num_tokens_from_string(original_text, "cl100k_base") num_tokens_zh = num_tokens_from_string(translated_text_zh, "cl100k_base") num_tokens_en = num_tokens_from_string(translated_text_en, "cl100k_base") original_text_embedding = model.encode(original_text) translated_text_embedding_zh = model.encode(translated_text_zh) translated_text_embedding_en = model.encode(translated_text_en) cos_sim_zh = cosine_similarity([original_text_embedding], [translated_text_embedding_zh])[0][0] cos_sim_en = cosine_similarity([original_text_embedding], [translated_text_embedding_en])[0][0] data.append({ "Original Text": original_text, "Translated Text (FR)": translated_text_zh, "Translated Text (EN)": translated_text_en, "Token Count (Original)": ori_num_tokens, "Token Count (FR)": num_tokens_zh, "Token Count (EN)": num_tokens_en, "Token Ratio (FR)": num_tokens_zh / ori_num_tokens, "Token Ratio (EN)": num_tokens_en / ori_num_tokens, "Cosine Similarity (FR)": cos_sim_zh, "Cosine Similarity (EN)": cos_sim_en }) df = pd.DataFrame(data) plt.figure(figsize=(12, 6)) plt.subplot(1, 2, 1) sns.boxplot(data=df[["Token Ratio (FR)", "Token Ratio (EN)"]]) plt.title("Token Ratio Distribution") plt.subplot(1, 2, 2) sns.boxplot(data=df[["Cosine Similarity (FR)", "Cosine Similarity (EN)"]]) plt.title("Cosine Similarity Distribution") plt.tight_layout() plt.show() df.to_csv("fr_translation_analysis_romania.csv", index=False) ```