台灣語言的網路應用

<h1 class="r-fit-text">台灣語言的網路應用</h1> [江豪文](https://howard-haowen.rohan.tw/) 課程：台大自然語言處理與網路應用 2022-11-26 --- # 大綱 - 我的背景 - Streamlit App (full-stack Python) - AI模型輔助語言學習 - 台灣南島語-華語句庫資料集 - React App (JavaScript, TypeScript) - 台灣語言Wordle遊戲 --- # 我的背景 ---- ## 教育 ![](https://github.com/howard-haowen/NLP-demos/raw/main/img/education_blue.png) ---- ## 學界經驗 - 東吳大學、台北科技大學英文講師 - 北京大學博雅博士後 ![](https://pbs.twimg.com/media/EE8AfNgXkAUo55-?format=jpg) ---- ## 業界經驗 - 哈瑪星科技、港商慧科訊業AI工程師 ![](https://img-9gag-fun.9cache.com/photo/aeAznm5_460s.jpg =600x) ---- - 現職：新光人壽數據分析襄理 ![](https://i0.wp.com/www.friendlyagentbot.com/wp-content/uploads/2020/10/IMG_0851.jpg?w=1242&ssl=1 =600x) --- <h1 class="r-fit-text">AI模型輔助語言學習</h1> [點我開啟APP](https://share.streamlit.io/howard-haowen/spacy-streamlit/app.py) ---- ## Stacks - NLP模型: [spaCy](https://spacy.io/) - 前端: [Streamlit](https://streamlit.io/) - 部署: [Streamlit Cloud](https://streamlit.io/cloud) - 程式與資料倉儲: [GitHub](https://github.com) ---- ## 開發環境 - 拋棄式 - [開啟Binder](https://mybinder.org/v2/gh/howard-haowen/spacy-streamlit/HEAD) - 永久式 - Fork [my repo](https://github.com/howard-haowen/spacy-streamlit) - Connect [VS Code for the Web](https://vscode.dev/) to your forked repo - Create an account on [Streamlit Cloud](https://streamlit.io/cloud) - Create an APP using your forked repo ---- ## 檔案架構 ```txt= README.md app.py jieba models.json pages |-- 01_\360\237\215\212Mandarin.py" |-- 02_\360\237\215\243Japanese.py" |-- 03_\360\237\215\224English.py" requirements.txt tocfl_wordlist.csv ``` ---- ## Streamlit Cloud[依賴設定](https://docs.streamlit.io/streamlit-cloud/get-started/deploy-an-app/app-dependencies) - pip: `requirements.txt` - conda: `environment.yml` - pipenv: `Pipfile` - poetry: `project.toml` ---- ## `requirements.txt` ```txt= # ja_ginza is a Japanese model with a lemmatizer and a morphologizer more fine-grained than the default one in spaCy ginza ja_ginza # ja_ginza depends on spacy>=3.2.0,<3.3.0 spacy>=3.2.0,<3.3.0 spacy-streamlit>=1.0.0rc1,<1.1.0 spacy-wordnet # spacy-wordnet depends on nltk nltk # sapCy models https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.2.0/zh_core_web_sm-3.2.0.tar.gz#egg=zh_core_web_sm https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0.tar.gz#egg=en_core_web_sm # conversion between hanzi and transcriptions dragonmapper # Jisho online Japanese dictionary jisho_api # YAKE keyword extraction spacy-ke # interactive plotting plotly ``` ---- ## app.py - 程式進入點 ```python= import streamlit as st st.markdown(""" # AI模型輔助語言學習 ## Language Learning Assisted by AI Models - 開啟左側選單可選擇語言，目前支援華語、日語和英語。 - Select a language from the sidebar. Supported languages include Mandarin, Japanese, and English. - 選單自動隱藏時，點選左上角 > 符號以開啟選單。 - If the sidebar is hidden, click on the > symbol in the upper left corner to open it. """) ``` ---- ## pages/01_:tangerine:Mandarin.py - 選擇斷詞模型 ```python=130 selected_tokenizer = st.radio("請選擇斷詞模型", ["jieba-TW", "spaCy"]) if selected_tokenizer == "jieba-TW": nlp.tokenizer = JiebaTokenizer(nlp.vocab) ``` ---- ## pages/01_:tangerine:Mandarin.py - 用[jieba-tw](https://github.com/APCLab/jieba-tw)取代spaCy內建斷詞模型 ```python=49 class JiebaTokenizer: def __init__(self, vocab): self.vocab = vocab def __call__(self, text): words = jieba.cut(text) # returns a generator tokens = list(words) # convert the genetator to a list spaces = [False] * len(tokens) doc = Doc(self.vocab, words=tokens, spaces=spaces) return doc ``` ---- ## pages/01_:tangerine:Mandarin.py - 使用[spaCy Token API](https://spacy.io/api/token)過濾tokens ```python=61 def filter_tokens(doc): clean_tokens = [tok for tok in doc if tok.pos_ not in PUNCT_SYM] clean_tokens = ( [tok for tok in clean_tokens if not tok.like_email and not tok.like_num and not tok.like_url and not tok.is_space] ) return clean_tokens ``` ---- ## pages/01_:tangerine:Mandarin.py - 選擇各種輔助功能 ```python=141 st.info("請勾選以下至少一項功能") # keywords_extraction = st.sidebar.checkbox("關鍵詞分析", False) # YAKE doesn't work for Chinese texts analyzed_text = st.checkbox("增強文本", True) defs_examples = st.checkbox("單詞解析", True) # morphology = st.sidebar.checkbox("詞形變化", True) freq_count = st.checkbox("詞頻統計", True) ner_viz = st.checkbox("命名實體", True) tok_table = st.checkbox("斷詞特徵", False) ``` ---- ## pages/01_:tangerine:Mandarin.py - 命名實體視覺化跟斷詞特徵都是spaCy內建功能 ```python=204 if ner_viz: ner_labels = nlp.get_pipe("ner").labels visualize_ner(doc, labels=ner_labels, show_table=False, title="命名實體") if tok_table: visualize_tokens(doc, attrs=["text", "pos_", "tag_", "dep_", "head"], title="斷詞特徵") ``` ---- ## pages/01_:tangerine:Mandarin.py - 增強文本依賴套件[dragonmapper](https://pypi.org/project/dragonmapper/) ```python=150 if analyzed_text: st.markdown("## 增強文本") pronunciation = st.radio("請選擇輔助發音類型", ["漢語拼音", "注音符號", "國際音標"]) for idx, sent in enumerate(doc.sents): tokens_text = [tok.text for tok in sent if tok.pos_ not in PUNCT_SYM] pinyins = [hanzi.to_pinyin(word) for word in tokens_text] sounds = pinyins if pronunciation == "注音符號": zhuyins = [transcriptions.pinyin_to_zhuyin(word) for word in pinyins] sounds = zhuyins elif pronunciation == "國際音標": ipas = [transcriptions.pinyin_to_ipa(word) for word in pinyins] sounds = ipas ``` ---- ## pages/01_:tangerine:Mandarin.py - 單詞解析依賴[萌典API](https://github.com/g0v/moedict-webkit) ```python=22 def moedict_caller(word): st.write(f"### {word}") req = requests.get(f"https://www.moedict.tw/uni/{word}.json") try: definitions = req.json().get('heteronyms')[0].get('definitions') df = pd.DataFrame(definitions) df.fillna("---", inplace=True) if 'example' not in df.columns: df['example'] = '---' if 'synonyms' not in df.columns: df['synonyms'] = '---' if 'antonyms' not in df.columns: df['antonyms'] = '---' cols = ['def', 'example', 'synonyms', 'antonyms'] df = df[cols] df.rename(columns={ 'def': '解釋', 'example': '例句', 'synonyms': '同義詞', 'antonyms': '反義詞', }, inplace=True) with st.expander("點擊 + 查看結果"): st.table(df) except: st.write("查無結果") ``` ---- ## pages/01_:tangerine:Mandarin.py - `單詞解析`附加功能: 根據[華語八千詞](http://huayutools.mtc.ntnu.edu.tw/ts/TextSegmentation.aspx)詞彙分級統計比例 ```python=99 def get_level_pie(tocfl_result): level = tocfl_result['詞條分級'].value_counts() fig = px.pie(tocfl_result, values=level.values, names=level.index, title='詞彙分級圓餅圖') return fig ``` --- <h1 class="r-fit-text">台灣南島語-華語句庫資料集</h1> [點我開啟APP](http://howard-haowen.rohan.tw/Formosan-languages/) ---- ## Stacks - 前端: [Streamlit](https://streamlit.io/) - 部署: [Streamlit Cloud](https://streamlit.io/cloud) - 程式與資料倉儲: [GitHub](https://github.com) ---- ## 開發環境 - 拋棄式 - [開啟Binder](https://mybinder.org/v2/gh/howard-haowen/Formosan-languages/HEAD) - 永久式 - Fork [my repo](https://github.com/howard-haowen/Formosan-languages) - Connect [VS Code for the Web](https://vscode.dev/) to your forked repo - Create an account on [Streamlit Cloud](https://streamlit.io/cloud) - Create an APP using your forked repo ---- ## 檔案架構 ```txt= README.md app.py requirements.txt Formosan-Mandarin_sent_pairs_139023entries.pkl ``` ---- ## `requirements.txt` ```txt= streamlit pandas pandas_profiling jupyterlab-git ``` ---- ## app.py - 緩存資料庫 ```python=129 @st.cache #import streamlit as st def get_data(): df = pd.read_pickle('Formosan-Mandarin_sent_pairs_139023entries.pkl') df = df.astype(str, errors='ignore') df = df.applymap(lambda x: x[1:] if x.startswith(".") else x) df = df.applymap(lambda x: x.strip()) filt = df.Ch.apply(len) < 5 df = df[~filt] return df ``` ---- ## app.py - 緩存資料彙整報告 ```python=140 def get_report(): #from pandas_profiling import ProfileReport df = get_data() report = ProfileReport(df, title='Report', minimal=True).to_html() return report ``` ---- ## app.py - 依據用戶提供條件過濾資料庫 - `資料來源` ```python=52 sources = st.sidebar.multiselect( "請選擇資料來源", options=source_set, default='詞典',) ``` ```python=66 s_filt = df['來源'].isin(sources) ``` ---- ## app.py - 依據用戶提供條件過濾資料庫 - 資料來源 - `查詢對象語言別` ```python=56 langs = st.sidebar.selectbox( "請選擇語言", options=['布農','阿美','撒奇萊雅','噶瑪蘭','魯凱','排灣','卑南', '泰雅','賽德克','太魯閣','鄒','拉阿魯哇','卡那卡那富', '邵','賽夏','達悟'],) ``` ```python=69 if langs == "噶瑪蘭": l_filt = df['Language'] == "Kavalan" elif langs == "阿美": l_filt = df['Language'] == "Amis" ``` ---- ## app.py - 依據用戶提供條件過濾資料庫 - 資料來源 - 查詢對象語言別 - `關鍵詞查詢語言別` ```python=61 texts = st.sidebar.radio( "請選擇關鍵詞查詢文字類別", options=['華語','族語'],) ``` ```python=103 text_box = st.sidebar.text_input('在下方輸入華語或族語，按下ENTER後便會自動更新查詢結果') # search for keywords in Mandarin or Formosan t_filt = df[texts].str.contains(text_box, flags=re.IGNORECASE) ``` ---- ## app.py - 依據用戶提供條件過濾資料庫 - 資料來源 - 查詢對象語言別 - 關鍵詞查詢語言別 ```python=109 filt_df = df[(s_filt)&(l_filt)&(t_filt)] ``` --- <h1 class="r-fit-text">台灣語言Wordle遊戲</h1> - 原始遊戲: [英語Wordle](https://www.nytimes.com/games/wordle/index.html) - 靈感來源: [AnyLanguage-Wordle](https://github.com/roedoejet/AnyLanguage-Wordle) - 台灣語言Wordle - [台灣閩南語](https://howard-haowen.github.io/tsm-wordle/) - [排灣語](https://howard-haowen.github.io/paiwan-wordle/) - [阿美語](https://howard-haowen.github.io/amis-wordle/) ---- ## Stacks - 前端: [React](https://reactjs.org/) - 部署: [GitHub Pages](https://pages.github.com/) - CI/CD: [GitHub Actions](https://github.com/features/actions) - 程式與資料倉儲: [GitHub](https://github.com) - [台灣閩南語](https://github.com/howard-haowen/tsm-wordle) - [排灣語](https://github.com/howard-haowen/paiwan-wordle) - [阿美語](https://github.com/howard-haowen/amis-wordle) ---- ## GitHub Pages - 免費部署靜態網頁([範例](https://howard-haowen.rohan.tw/)) ![](https://pages.github.com/images/theme-chooser@2x.png) ---- ## GitHub Actions - 免費自動化測試與部署 ![](https://blog.kiprosh.com/content/images/2019/12/github-action.jpg) ---- ## [排灣語Wordle](https://github.com/howard-haowen/paiwan-wordle)檔案架構 ```txt= .env .github |-- workflows | |-- publish.yml .gitignore .husky |-- .gitignore |-- pre-commit .prettierrc .vscode |-- settings.json Dockerfile LICENSE README.md index.html package-lock.json package.json postcss.config.js public |-- favicon.ico |-- index.html |-- locales | |-- en | | |-- translation.json | |-- es | | |-- translation.json | |-- zh | | |-- translation.json |-- logo192.png |-- logo512.png |-- manifest.json |-- robots.txt src |-- App.css |-- App.test.tsx |-- App.tsx |-- components | |-- alerts | | |-- Alert.tsx | |-- grid | | |-- Cell.tsx | | |-- CompletedRow.tsx | | |-- CurrentRow.tsx | | |-- EmptyRow.tsx | | |-- Grid.tsx | |-- keyboard | | |-- Key.tsx | | |-- Keyboard.tsx | |-- mini-grid | | |-- MiniCell.tsx | | |-- MiniCompletedRow.tsx | | |-- MiniGrid.tsx | |-- modals | | |-- AboutModal.tsx | | |-- BaseModal.tsx | | |-- InfoModal.tsx | | |-- StatsModal.tsx | | |-- TranslateModal.tsx | |-- stats | | |-- Histogram.tsx | | |-- Progress.tsx | | |-- StatBar.tsx |-- constants | |-- config.ts | |-- orthography.ts | |-- validGuesses.ts | |-- wordlist.ts |-- i18n.ts |-- index.css |-- index.tsx |-- lib | |-- keyboard.ts | |-- localStorage.ts | |-- share.ts | |-- stats.ts | |-- statuses.ts | |-- tokenizer.ts | |-- words.ts |-- logo.svg |-- react-app-env.d.ts |-- reportWebVitals.ts |-- setupTests.ts tailwind.config.js tsconfig.json update_log.md ``` ---- - 需要客製化修改的部分 ```txt= .github |-- workflows | |-- publish.yml package.json public |-- locales | |-- en | | |-- translation.json | |-- es | | |-- translation.json | |-- zh | | |-- translation.json src |-- constants | |-- config.ts | |-- orthography.ts | |-- validGuesses.ts | |-- wordlist.ts ``` ---- ## 資料來源 - 台灣南島語: [台灣南島語-華語句庫資料集](https://github.com/howard-haowen/Formosan-languages) - 台灣閩南語: [教育部臺灣閩南語常用詞辭典](https://github.com/g0v/moedict-data-twblg) ---- ## 資料準備 - 書寫單位表 `src/constants/orthography.ts` ```typescript= import { CONFIG } from './config' export const ORTHOGRAPHY = [ 'a', 'b', 'c', 'd', 'dj', 'dr', 'e', 'g', 'h', 'i', 'j', 'k', 'l', 'lj', 'm', 'n', 'ng', 'o', 'p', 'q', 'r', 's', 't', 'tj', 'u', 'v', 'w', 'y', 'z', ] if (CONFIG.normalization) { ORTHOGRAPHY.forEach( (val, i) => (ORTHOGRAPHY[i] = val.normalize(CONFIG.normalization)) ) } ``` ---- - 合法詞詞表`src/constants/validGuesses.ts` ```typescript import { CONFIG } from './config' export const VALIDGUESSES = [ 'timadju', 'aravac', 'nimadju', 'caucau', 'mareka', ] if (CONFIG.normalization) { VALIDGUESSES.forEach( (val, i) => (VALIDGUESSES[i] = val.normalize(CONFIG.normalization)) ) } ``` ---- - 謎底詞詞表`src/constants/wordlist.ts` ```typescript import { CONFIG } from './config' export const WORDS = [ 'timadju', 'aravac', 'nimadju', 'caucau', 'mareka', ] if (CONFIG.normalization) { WORDS.forEach((val, i) => (WORDS[i] = val.normalize(CONFIG.normalization))) } function shuffle(array: any[]) { for (let i = array.length - 1; i > 0; i--) { const j = Math.floor(Math.random() * (i + 1)) ;[array[i], array[j]] = [array[j], array[i]] } } if (CONFIG.shuffle) { shuffle(WORDS) } ``` ---- ## 設定參數 - App設定檔`src/constants/config.ts` ```typescript export const CONFIG = { tries: 6, // This changes how many tries you get to finish the wordle language: 'Paiwan', // This changes the display name for your language wordLength: 6, // This sets how long each word is based on how many characters (as defined in orthography.ts) are in each word author: 'Haowen Jiang', // Put your name here so people know who made this Wordle! authorWebsite: 'https://howard-haowen.github.io/', // Put a link to your website or social media here wordListSource: 'Dataset of Formosan-Mandarin sentence pairs', // Describe the source material for your words here wordListSourceLink: 'https://share.streamlit.io/howard-haowen/formosan-languages/main/app.py', // Put a link to the source material for your words here // // THESE NEXT SETTINGS ARE FOR ADVANCED USERS // googleAnalytics: '', // You can use this if you use Google Analytics shuffle: false, // whether to shuffle the words in the wordlist each time you load the app (note: you will lose the 'word of the day' functionality if this is true) normalization: 'NFC', // whether to apply Unicode normalization to words and orthography - options: 'NFC', 'NFD', 'NKFC', 'NKFD', false startDate: 'March 1, 2022 00:00:00', // what date and time to start your game from defaultLang: 'zh', // the default interface language availableLangs: ['en', 'zh'], // the options available to the user for translation languages } ``` ---- - App介面語言配對檔`locales/zh/translation.json` ```json= { "about": "關於本遊戲", "gameName": "Pinaiwanan, 排灣語", "enterKey": "輸入", "deleteKey": "刪除", "notEnoughLetters": "字母數不足，請填滿方塊", "wordNotFound": "查無此單詞", "solution": "正解為 {{solution}}", "gameCopied": "已將您的輝煌戰績複製至剪貼簿，開始分享吧！", "howToPlay": "遊戲規則", "instructions": "您總共有 {{tries}} 次猜測單詞的機會。輸入的詞必須填滿所有方塊。每輸入一個單詞，方塊將會改變顏色，透過顏色的變化您將能推測正解。", "pickYourLanguage": "選擇語言", "correctSpotInstructions": "字母 T 出現在正解當中，並且在正確的位置。", "wrongSpotInstructions": "字母 Y 出現在正解當中，但不在正確的位置。", "notInWordInstructions": "字母 U 不在正解當中。", "guessDistribution": "猜詞分佈", "statistics": "統計", "totalTries": "總猜測次數", "successRate": "成功率", "currentStreak": "當前連續獲勝次數", "bestStreak": "最佳連續獲勝次數", "newWordCountdown": "距離下個新單詞出現還有", "share": "分享", "winMessages": [ "太棒了！", "完美！", "做得好！" ], "firstExampleWord": [{ "letter": "T", "highlight": true }, { "letter": "I", "highlight": false }, { "letter": "M", "highlight": false }, { "letter": "A", "highlight": false }, { "letter": "DJ", "highlight": false }, { "letter": "U", "highlight": false } ], "secondExampleWord": [{ "letter": "P", "highlight": false }, { "letter": "A", "highlight": false }, { "letter": "Y", "highlight": true }, { "letter": "U", "highlight": false }, { "letter": "A", "highlight": false }, { "letter": "N", "highlight": false } ], "thirdExampleWord": [{ "letter": "TJ", "highlight": false }, { "letter": "U", "highlight": false }, { "letter": "R", "highlight": false }, { "letter": "U", "highlight": false }, { "letter": "V", "highlight": false }, { "letter": "U", "highlight": true } ], "aboutAuthorSentence": "本遊戲源自 Wordle 開源專案，由<3>{{author}}</3>改編為排灣語版本。", "aboutCodeSentence": "原專案作者為<3>Hannah Park</3>，原始碼在<1>the original code</1>。本遊戲的開發依據為<5>Aidan Pine's fork</5>。", "aboutDataSentence": "本遊戲的單詞來源為 <1>{{wordListSource}}</1>。", "aboutOriginalSentence": "您也可以前往 <1>here</1> 玩一玩原版的 Wordle 喔！", "languages": { "zh": "中文", "en": "English" } } ``` ---- ## 自動化測試與部署 - Yaml設定檔名稱沒有限制`publish.yml` - 但設定檔路徑一定要在`.github/workflows/`之下 - npm套件安裝依據專案目錄下的`package.json` ```yaml= name: Deploy on: pull_request: push: branches: - main jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Install run: | npm install - name: Test run: | npm run test - name: Build run: | npm run-script build - name: Deploy 🚀 uses: JamesIves/github-pages-deploy-action@v4.2.3 with: branch: gh-pages # The branch the action should deploy to. folder: build # The folder the action should deploy. ``` --- # 作業 - 任選一項 - Fork[AI模型輔助語言學習APP](https://github.com/howard-haowen/spacy-streamlit)，新增一個語言後([spaCy模型所支援的語言](https://spacy.io/models))，發出pull request - 仿照台灣南島語-華語句庫資料集，製作一個文本類型的資料庫APP，文本來源自選 - 新增一個台灣語言Wordle遊戲（排除台灣南島語、排灣語、阿美語）