# Abstract Services of personalized TTS systems for the Mandarin-speaking speech impaired are rarely mentioned. Taiwan started the VoiceBanking project in 2020, aiming to build a complete set of services to deliver personalized Mandarin TTS systems to amyotrophic lateral sclerosis patients. This paper reports the corpus design, corpus recording, data purging and correction for the corpus, and evaluations of the developed personalized TTS systems, for the VoiceBanking project. The developed corpus is named after the VoiceBank-2023 speech corpus because of its release year. The corpus contains 29.78 hours of utterances with prompts of short paragraphs and common phrases spoken by 111 native Mandarin speakers. The corpus is labeled with information about gender, degree of speech impairment, types of users, transcription, SNRs, and speaking rates. The VoiceBank-2023 is available by request for non-commercial use and welcomes all parties to join the VoiceBanking project to improve the services for the speech impaired. 針對北京話語音障礙者的個人化文字轉語音系統服務很少被提及。台灣於2020年啟動了語音銀行計劃,旨在建立一套完整的服務,為漸凍人患者提供個人化的北京話文字轉語音系統。本論文報告了語音銀行計劃的語料庫設計、語料庫記錄、語料庫數據清理和校正,以及對所開發的個人化文字轉語音系統的評估。開發的語料庫因其發布年份而命名為VoiceBank-2023語音語料庫。該語料庫包含111位母語為北京話的人所講的29.78小時的短段落的提示和常用片語。語料庫標註了性別、語言障礙程度、用戶類型、轉錄出的文字、訊雜比和語速等信息。VoiceBank-2023可應要求用於非商業用途,也歡迎各方各界加入語音銀行計劃,以改善為語言障礙者提供的服務。 ## Introduction The voice of each individual may be regarded as his/her identity. Amyotrophic lateral sclerosis (ALS) patients will gradually lose the ability to control their muscles, which affects the control of the glottal fold and shape of the vocal tract, and become difficult to pronounce and communicate smoothly. ALS patients are encouraged to record their voices before they become dysarthria. The recorded speech can be used to construct personalized text-to-speech (TTS) systems, which serve as speech-generating devices (SGD) for augmentative and alternative communication (AAC). 每個人的聲音都可以視為其身份象徵。肌萎縮性脊髓側索硬化症(ALS)患者會逐漸失去控制肌肉的能力,從而影響控制聲門閉合和聲道形狀,導致發音與溝通困難。我們鼓勵ALS患者在失去語言能力之前錄製他們的聲音。錄製下來的語音可以用來構建個人化的文字轉語音(TTS)系統,作為輔助和替代溝通(AAC)的語音生成設備(SGD)。 --- In English-speaking countries, many companies or research institutes are providing services to make personalized TTS systems for ALS patients to use. The significant one is Model Talker \cite{ModelTalker}, which is the earliest and largest research platform in the US, established by the Nemours Speech Research Laboratory located at the Alfred I. duPont Hospital for Children in Delaware, US. With the advances in speech technologies, the following commercial SGD providers can be found: Cereproc Cerevoice ME \cite{CereProc}, VocalID \cite{VOCALiD}, Acapela my-own-voice DNN \cite{Acapela}, the Voice Keeper \cite{VoiceKeeper}, and SpeakUnique\footnote{SpeakUnique powers the user-friendly voice-banking platform, “I Will Always Be Me” (https://www.iwillalwaysbeme.com/)} \cite{SpeakUnique}. 在英語系國家,許多公司或研究機構為ALS患者提供個人化TTS系統的服務。其中最著名的是Model Talker [1],它是美國最早也是最大的研究平台,由位於美國德拉瓦州的阿爾弗雷德·杜邦兒童醫院的 Nemours Speech Research Laboratory建立。隨著語音技術的進步,可以找到以下的商業SGD提供者:Cereproc Cerevoice ME [2]、VocalID [3]、Acapela my-own-voice DNN [4]、the Voice Keeper [5]和SpeakUnique [6]。 --- In 2011, the Motor Neurone Disease (MND) Association in the UK \cite{mndauk} started the voice banking project to recommend that patients deposit their voices and messages with the services provided by the abovementioned institutes or companies. Similarly, in 2018, the ALS association in the US initiated Project Revoice \cite{usrv} to encourage patients to contact the institutes or vendors mentioned above to build customized TTS systems for themselves before they become dysarthria. 2011年,英國的運動神經元疾病(MND)協會[7]開始了聲音儲存計畫,建議患者將他們的聲音和訊息存入上述研究機構或公司提供的服務中。同樣,在2018年,美國的ALS協會啟動了Project Revoice [8],鼓勵患者與上述研究機構或供應商聯繫,在失去語言能力之前,為自己建立定制的TTS系統。 隨著人工智慧技術的快速發展,可以輕易找到構建TTS系統的開放資源。用於構建TTS的著名開源語音語料庫包括普通話的AISHELL3 [10],英語的VCTK [11]、ARCTIC [12]、libriTTS [13] 和 HiFiTTS [14]。然而,上述語料庫中沒有一個是專門為構建ALS患者的個人化系統所設計的。 --- # Specification for the VoiceBank-2023 corpus | | | |----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------| | Corpus Name | VoiceBank-2023 (URL: https://github.com/VoiceBank-NTPU-TW/VoiceBank-2023) | | Language | mostly Taiwanese Mandarin | | Text/Prompt materials | 1) Part-1 (VoiceBanking): 133 short paragraphs<br>2) Part-2 (Common Phrases): 556 common phrases | | Speaking Style | 1) read speech for Part-1 (VoiceBanking)<br>2) spontaneous like for Part-2 (Common Phrases) | | Uses | 1) personalized TTS, 2) assessments of dysarthria, voice quality (jitter/shimmer), and sound quality (regarding recording) | | # of Speakers (speaker types, gender, dysarthria degree) | 111(all) <br>= 39(ALS patients) + 63(voice donors) + 9(unknowns) <br>= 47(female) + 64(male) <br>= 86(degree 1: high speech intelligibility) + 11(degree 2) + 12(degree 3) + 2(degree 4: low speech intelligibility) | # of Utterances (prompt type, gender, speaker type, dysarthria degree)|12,875(all) <br>= 7,625(Part-1, VoiceBanking) + 5,250(Part-2, Common Phrases) <br>= 5,677(female) + 7,198(male) <br>= 8,876(patient) + 3,875(donor) + 124(unknown) <br>= 8,760/2,246/1,849/20(degree 1/2/3/4) | Total Duration (hours) | 29.78(all) <br>= 28.18(Part-1:VoiceBanking) + 1.60(Part-2: Common Phrase) <br>= 12.47(female) + 17.31(male) <br>= 17.66(patients) + 11.78(donors) + 0.34(unknowns) <br>= 19.37/5.74/4.58/0.09(degree 1/2/3/4)| | Duration for each Speaker (minutes) |Part-1 (VoiceBanking): 15.37±10.97<br>Part-2 (Common Phrases): 5.99±5.34 | \# of Syllables | 360,586(all) <br>= 342,486(Part-1:VoiceBanking) + 18,100(Part-2: Common Phrase) <br>= 153,396(female) + 207,190(male)<br>= 185,401(patients) + 170,387(donors) + 4,798(unknowns) <br>= 270,805/55,490/33,835/456(degree 1/2/3/4)| | Utterance Length in Syllable | Part-1 (VoiceBanking): 44.13±9.03<br>Part-2 (Common Phrases): 3.30± 0.54<br>(utterance-wise mean±standard deviation) | | Utterance Length in Second | Part-1 (VoiceBanking): 13.16±4.87<br>Part-2 (Common Phrases): 1.08±0.32 <br>(utterance-wise mean±standard deviation) | | Waveform Encoding | linear PCM, 48kHz sample rate, 16-bit resolution, mono channel | | Microphone/Recording Environment | mostly USB quality microphone/mostly home or office | | Files for each Utterance | 1) *.TextGrid: time alignments for phonetic (initial/final), syllabic (tone), and word (part of speech and punctuation marks)<br>2) *.txt: raw text file in UTF-8<br>3) *.wav: WAVE file # VoiceBank-2023 語料規格表 | | | |----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------| | 語料庫名稱 | VoiceBank-2023 (URL: https://github.com/VoiceBank-NTPU-TW/VoiceBank-2023) | | 語言 | 國語 | | 文本/提示詞 | 1) 第一部分(VoiceBanking): 133 段短段落<br>2) 第二部分(常用片語):556 個常用片語 | | 說話風格 | 1) 第一部分(VoiceBanking):朗讀語音<br>2) 第二部分(常用片語):自發性語音 | | 用途 | 1) 建立個人化文字轉語音系統 TTS, 2) 評估構音障礙等級、語音品質(抖動 jitter/顫動 shimmer)、以及音質(錄音環境) | | # 語者 (語者類別, 性別, 構音障礙程度) | 111(總計) <br>= 39(ALS 患者) + 63(聲音捐贈者) + 9(未知)<br> = 47(女性) + 64(男性) <br>= 86(等級一:高語音可理解度) + 11(等級二) + 12(等級三) + 2(等級四::低語音理解度) | # 語料 (語料分類, 性別, 語者類別, 構音障礙程度)|12,875(總計) <br>= 7,625(第一部分:VoiceBanking) + 5,250(第二部分:常用片語) <br>= 5,677(女性) + 7,198(男性) <br>= 8,876(ALS 患者) + 3,875(聲音捐贈者) + 124(未知) <br>= 8,760/2,246/1,849/20(等級 1/2/3/4) | 語料庫總時長 (小時) | 29.78(all) <br>= 28.18(第一部分:VoiceBanking) + 1.60(第二部分:常用片語) <br>= 12.47(女性) + 17.31(男性) <br>= 17.66(ALS 患者) + 11.78(聲音捐贈者) + 0.34(未知) <br>= 19.37/5.74/4.58/0.09(等級 1/2/3/4)| | 語者平均貢獻時長 (分鐘) |第一部分(VoiceBanking):15.37±10.97<br>第二部分(常用片語):5.99±5.34 | \# 音節數 | 360,586(all) <br>= 342,486(第一部分:VoiceBanking) + 18,100(第二部分:常用片語) <br>= 153,396(女性) + 207,190(男性)<br>= 185,401(ALS 患者) + 170,387(聲音捐贈者) + 4,798(未知) <br>= 270,805/55,490/33,835/456(等級 1/2/3/4)| | 單句語料字數 | 第一部分(VoiceBanking): 44.13±9.03<br>第二部分(常用片語):3.30± 0.54<br>(語句為單位的平均值±標準偏差) | | 單句語料秒數 | 第一部分(VoiceBanking): 13.16±4.87<br>第二部分(常用片語):1.08±0.32<br>(語句為單位的平均值±標準偏差) | | 語音編碼方式 | 線性 PCM,48kHz 取樣率,16位元解析度,單聲道 | | 麥克風/錄音環境 | 主要使用USB麥克風/大部分在家或辦公室錄製 | | 語料檔案格式 | 1) *.TextGrid: 根據音素切割語料,標上時間戳記(開頭/結尾),韻律(聲調),和詞語(詞性和標點符號)<br>2) *.txt: UTF-8編碼的原始文本文件 <br>3) \*.wav: WAVE 音檔| # II.A. Consideration With the fast development of AI technologies, it is easy to find open resources for constructing TTS systems. The famous open-source speech corpora for constructing TTS are AISHELL3 [10] for Mandarin, VCTK [11], ARCTIC [12],libriTTS [13], and HiFiTTS [14] for English. However, noneof the corpora stated above are specially designed for con-structing personalized systems for ALS patients. Since ALS patients gradually lose their ability to speak, it is desired to collect patients’ speech with the following considerations: 隨著人工智能的快速發展,很容易找到用於構建 TTS 系統的開放資源。用於構建 TTS 的知名開源語音庫有:用於普通話的 AISHELL3 [10]、VCTK [11]、ARCTIC [12]、libriTTS [13] 和用於英語的 HiFiTTS [14]。 然而,上述語料庫中沒有一個是為構建 ALS 個人化系統而專門設計的。由於 ALS患者逐漸喪失說話能力,因此希望在收集患者語音時考慮以下因素: 1)Consistent Text Material: Because the corpus size for each patient is small, each personalized TTS system needs to be constructed by speaker adaptation technique. The Voice-Banking project used two large speech corpora, the Treebank-SR corpus [15] and the Danei corpus2 to train the refer-ence/prior model. Each personalized TTS model is then jointly trained with the two large speech corpora. To make adaptation data consistent with the data for the reference model, the text material for the patients’ adaptation speech corpus is designed to be subsets of the Treebank-SR corpus. 1)一致的文本材料: 由於每位患者的語料庫規模較小,因此每個個人化 TTS 系統都需要通過說話者調適技術來構建。Voice-Banking目使用了兩個大型語音庫,即 Treebank-SR corpus [15] 和 Danei corpus 2 來訓練參考/先驗模型。然後每個個人化 TTS 模型為了使調適數據與參考模型的數據保持一致,患者調適語料庫的文本材料必須與參考模型的數據保持一致,語音語料庫的文本材料被設計為 是 Treebank-SR 語料庫的子集。 2)Phonetically-Balanced Voice-Banking With Limited Text Material: The most important purpose of speech recording for a patient is to let the TTS model produce intelligible 2The Danei speech corpus is a Mandarin-English mixed speech corpus owned by AcoustInTek Co., Ltd. and licensed for the VoiceBanking project synthesized speech and learn the speaker’s identity from the recorded data. It is desired to collect speech data as small as possible to cover most pronunciations. Utterances for a speaker are recorded in a sorted sequence such that the utterances at the beginning incrementally cover most pronunciations quickly. 2) 語音平衡的Voice-Banking與有限的文本資料: 為患者錄製語音最重要目的是讓TTS 模型產生可理解的語音Danei語音語料庫是由AcoustInTek Co 有限公司擁有並授權用於語音銀行項目的普通話和英語混合語音語料庫。我們希望收集的語音數據越少越好,並涵蓋大多數發音。語者的語句按排序順序記錄,這樣開頭的語句就能迅速覆蓋大多數發音。 3)Common Phrases for Daily Life: To enrich the per-sonalized TTS with communicative or expressive functions,recorded common phrases can be enrolled in the training of the TTS model or directly be played back when using AAC.According to the considerations, the VoiceBank-2023 cor-pus was designed to have two parts with eight sub-corpora: 日常生活常用語: 為了豐富具有溝通或表達功能的個人化 TTS,錄製的常用片語可以加入 TTS 模型的訓練中,也可以在使用 AAC 時直接回放: 1)Part 1 - VoiceBanking (sub-corpora 1 and 2): - Sub-corpus 1: covers all Mandarin initial and final types - Sub-corpus 2: enlarge sample size for voice-banking 1)第 1 部分 - VoiceBanking(子語料庫 1 和 2): - 子語料庫 1:涵蓋所有普通話的聲母和韻母類型 - 子語料庫 2:擴大語音庫的樣本量 2)Part 2- Common Phrases (sub-corpora 3 to 8) - Sub-corpora 3 to 8: comprised of 1 to ≥ 6-character ##### phrases to enrich the communicative functions 2)第 2 部分--常用片語(第 3 至第 8 子句群) - 子句群 3 至 8:由 1 至 ≥ 6 個字符的短語組成,以豐富溝通功能。 ##### 片語能夠豐富溝通功能 # II.B. Part-1 VoiceBanking (子語料庫1和2) 1 )語料庫文本資源-Treebank-SR Corpus: voicebank的語料來源來自於Treebank-SR Corpus,並且使用此語料在建立中文控制語速TTS系統,以及研究韻律和文法。在VoiceBank中,為了讓使用者有初步的建置,也使用Treebank-SR Corpus建立最一開始的TTS模型,此語料庫由四個內容相同的子語料庫組成,分別為快速、中速、正常和慢速,每個部份有376個句子,總共有52031個音節,每個句子的平均長度是138.38個音節,標準差為24.97。備註:每個句子的內容通常對應到一個短段落。 2 )文本資源的問題: Treebank-SR Corpus的句子長度太長,以致於非專業語者無法流利說出,因此,我們需要依據發音包含範圍排序這376個句子,形成較短的句子,讓錄音更容易。備註:我們並不是一開始就切割段落並且排序,而形成短句子,因為我們更希望以更全面性的文字內容來錄製語者的語音。 3 )取得子語料庫的方法: 在此說明以下步驟會使用的符號,Part-1:VoiceBanking, i.e.$\{S_{j}\}|_{j=1,2,...,J}$是從376個段落延伸得到,i.e.$\{P_{k}\}|_{k=1,2,...,376}$是代表376個段落。 步驟一:計算排序的關鍵數值: 在376個文字段落中的第k個文字段落, i.e.$P_{k}$,我們計算:1)$|P_k|$:在第k個段落,音節的數量,2)$|T(P_k)|$:在第k個段落,一開始和結束有出現的種類數量,$T(\cdot)$是特殊的運算子。 步驟二:初始段落的排序: 將376個段落以$|T(P_k)|$和$P_{k}$進行排序,排序結果呈現降序,這376個排序過的段落表示為,$\{P_{k}\}|_{k=1,2,...,376} \gets sort(\{P_{k}\}|_{k=1,2,...,376})$。 步驟三:取得初始優先段落集合: 令$V=\{P_k\}|_{k+1}$為優先段落集合,並且令 $k$ 為重要段落指標,優先集合 $V$ 被用於計算發音單位的數量,i.e.從一開始和結束,確認被排序過的第一個和第 $k$ 個段落是否有平衡過的發音單位數量。 步驟四:更新優先段落集合並且調整段落集合的排列: 令 $i=k$ ,並且 $k$ 到376,如果 $|T(\{V \cup P_i\})| > |T(V)|$ ,那麼藉由交換 $P_i$ 和 $P_k$ 的內容來更新段落集合的排序,並且更新優先段落集合,$V \gets V \cup P_k$,設定 $k=k+1$ ,然後進行步驟五。 步驟五:確認終止標準: 令 $C_{i}(V)$ 代表在優先段落集合 $V$ 裡,第 $i$ 個的初始或是最終型態的數量,如果在 $i=1,2,...,I$ , $k>376$ 或是 $C_{i}(V)$,那麼就進行步驟六或是步驟四。 步驟六:取得Part-1:VoiceBanking初始提示文本: 我們透過手動調整排列過的段落 $\{P_{k}\}|_{k=1,2,...,376}$ ,形成有序列的短句 $\{S_{j}\}|_{j=1,2,...,J}$ 。備註:排列過的短句是希望在語義上有更好的呈現。 步驟七:修飾提示文本更具可讀性: 有部份短句 $\{S_{j}\}|_{j=1,2,...,J}$ 是不易閱讀的,因此,我們加入一些字到不易閱讀的句子中,讓閱讀更加容易,這些修飾過的短文本就是Part-1:VoiceBanking的提示文本。 # III.Corpus Recording語料庫記錄 Here, we report the recording process for the VoiceBank2023 corpus from Jan-20 - June-23. The corpus recording is divided into four phases according to the purpose of research and recording environments. TABLE III summarizes the four phrases. Phase 1 collected speech corpora by the onsite recording using the high-quality microphone with the help of audio technicians. On the other hand, in Phases 2-4, users banked their voices by logging into https://voicebank.ce.ntpu.edu.tw/, which provides a self-service GUI for voice recording. 在這裡,我們報告VoiceBank2023 語料庫從 2020 年1 月到 2023 年6 月的錄製過程。語料庫記錄依研究目的和記錄環境分為四個階段。 表 III 總結了這4個階段。第1階段在音訊技術人員的幫助下,使用高品質麥克風現場錄音收集語音語料庫。另一方面,在第 2-4 階段,使用者透過登入 https://voicebank.ce.ntpu.edu.tw/ 來儲存語音,該網站提供用於錄音的自助服務圖形介面。 # III.A.Phase-1: Onsite recording during 2020/4 to 2021/4第一階段:2020/4至2021/4期間現場錄製 We provided on-site recording services for the enrolled patients to bank their voices at their homes with the help of the research team in person, ensuring that the recordings can be fully used in constructing personalized TTS systems. The on-site recording is adopted because of the following reasons: 我們為參與計劃的病友提供現場錄音服務,在研究團隊的親自幫助下在家中保存他們的聲音,確保錄音能夠充分用於建立個人化TTS系統。採用現場錄音的原因如下: 1. This project needed to collect patients' voices as soon as possible before they become impaired. 2. It is not easy for the enrolled patients to commute to the professional recording studio. The on-site recording may increase the patients' willingness to bank their voices. 3. Patients would make more effort to read text materials than most people do. The on-site in-person assistance in recording may alleviate patients’ workload. 4. The on-site recording engineers ensured that the recording setups met high-quality requirements for constructing TTS. 5. Because Taiwan is densely populated, the on-site recording engineers help to reduce unwanted environmental interference from neighbors or outdoors. > 1. 此項目需要在病友出現構音障礙之前盡快收集病友的聲音。 2. 病友往返專業錄音室並不方便,現場錄音可能會增加病友保存聲音的意願。 3. 病友會比大多數人更難去閱讀文字資料。現場親自協助記錄可以減輕病友的工作量。 4. 現場錄音工程師確保錄音設定符合建置TTS的高品質要求。 5. 由於台灣人口稠密,現場錄音工程師有助於減少來自鄰居或室外的不必要的環境甘擾。 # III.B. Phases 2-4: Web-based recording(2022/1) Due to the COVID-19 pandemic becoming severe in Taiwan in 2021, the on-site recording was not feasible. Meanwhile,quality microphone headsets became prevalent because people needed to work from home with the conference call systems. We therefore developed the web-based recording platform -VoiceBank website (https://voicebank.ce.ntpu.edu.tw/) to fa-cilitate voice-banking for everyone. 由於2021年台灣 COVID-19 大流行變得嚴重,現場錄製變得不可行。與此同時,由於人們需要在家工作,使用視訊會議系統,高品質的麥克風耳機變得普遍。因此,我們開發了基於網絡的錄製平台 - VoiceBank網站(https://voicebank.ce.ntpu.edu.tw/) 來協助每個人的語音儲存。 --- TABLE III: A summary for the four phrases --- | | Phase 1 | Phase 2 | Phase 3 | Phase 4 | | -------- | -------- | -------- |-------- |-------- | | Description |onsite recording |trial web-based service<br>for voice donor |trial web-based service for patient|web-based service announced| |Dates | Apr. 2020 - Apr. 2021 | Jan.-Feb. 2022 | Feb. 2022 - Jan. 2023 | Mar.-Jun. 2023 | |Purpose | Test the feasibility of the VoiceBanking sub-corpus (Part-1) inconstructing personalized TTS | Test the feasibility of the web-based<br>recording at home by voice donors. | Test the feasibility of the web-based<br>recording at home by ASL patient | Test the feasibility of the web-based<br>recording at home for all | |Microphone | audio-technica ATM73a | various USB microphones<br>owned by speakers | various USB microphones<br>owned by speakers |various USB microphones<br>owned by speakers | |Audio Interface | Steinberg<br>UR RT-2 | USB microphones built-in |USB microphones built-in | USB microphones built-in | |Audio Recording Software | Adobe Audition | login https://voicebank.<br>ce.ntpu.edu.tw/ with a browser | login https://voicebank.<br>ce.ntpu.edu.tw/ with a browser | login https://voicebank.<br>ce.ntpu.edu.tw/ with a browser| |Guidance for<br>Recording | recording technicians | self service GUI | self service GUI | self service GUI | |Prompts | printed on papers | displayed on GUI |displayed on GUI | displayed on GUI | |Registration | manual | manual in the backend of website | manual in the backend of website | online on website | |Speakers # (patient/donor/<br>unknown)(gender) (dysarthria degree 1/2/3/4) | 20 (18/2/0)(6F/14M) (9/5/5/1) | 61 (0/61/0)<br>(27F/34M) (61/0/0/0) | 18 (17/0/1)<br>(7F/11M) (5/5/7/1) | 12 (4/0/8)<br>(7F/5M) (11/1/0/0) | | | 階段一 | 階段二 | 階段三 | 階段四 | | -------- | -------- | -------- |-------- |-------- | | 描述 | 現場錄音 | 為語者提供基於網路的使用服務|為病友提供基於網路的使用服務 |推出以網路為基礎的服務 | | 日期 | 2020四月-2021四月 | 2022一月至二月 |2022二月-2023一月|2023三月-六月| |目的|測試VoiceBanking子語料(Part-1)建構個人化TTS的可行性|測試基於網路的可行性,語者在家中錄音|測試基於網路的可行性,ASL患者在家中錄音|測試基於網路的可行性,所有人在家中錄音| |麥克風|鐵三角ATM73a|各種USB麥克風,來源於語者|各種USB麥克風,來源於語者|各種USB麥克風,來源於語者| |音訊介面|Steinberg UR RT-2|內建USB麥克風|內建USB麥克風|內建USB麥克風| |錄音軟體|Adobe Audition|登入https://voicebank.ce.ntpu.edu.tw/|登入https://voicebank.ce.ntpu.edu.tw/|登入https://voicebank.ce.ntpu.edu.tw/| |錄音指導|錄音技術人員|個人化服務圖形使用者介面|個人化服務圖形使用者介面|個人化服務圖形使用者介面| |提示|印在紙上|顯示在圖形使用者介面上|顯示在圖形使用者介面上|顯示在圖形使用者介面上| |登記|手動|手動至網站後台|手動至網站後台|線上| |演講者#(病人/捐贈者/未知)(性別)(構音障礙程度1/2/3/4)|20 (18/2/0)(6F/14M) (9/5/5/1)|61 (0/61/0)(27F/34M) (61/0/0/0)|18 (17/0/1)(7F/11M) (5/5/7/1)|12 (4/0/8)(7F/5M)11/1/0/0)| --- The functions of audio recording are provided by the RecordRTC library [21]. The display and playback of a recorded waveform are powered by the wavesurfer.js library [22]. The recorded waveforms are in a format of linear PCM with a 48KHz sample rate and 16-bit resolution. Note that we turn off the functions of speech enhancement and auto-gain control of RecordRTC to avoid destroying the original speech quality. Instead, we encourage the enrolled speakers to use USB-terminal microphones to record their voices in a quiet environment. 聲音錄製功能由RecordRTC函式庫[21]提供。錄製的波形顯示和回放由wavesurfer.js函式庫[22]提供支援。錄製的波形以48KHz的取樣率和16位元解析度的線性PCM格式保存。請注意,我們關閉了RecordRTC的語音增強和自動增益控制功能,以避免破壞原始語音品質。相反,我們鼓勵註冊的語者在安靜的環境中使用USB麥克風來錄製他們的聲音。 --- Fig. 2 shows the GUI for the recording page on the VoiceBank website.The left pane shows the selection for the sub-corpora and its including paragraphs. The right pane contains, from top to down, a sentence index displayer, a waveform displayer, a text prompt, and a recording control pane. To record the corpora, the users can first select a sub-corpus and one of its including paragraphs on the left pane.Once a paragraph is selected, the prompt of the first sentence of the paragraph of a sub-corpus will be shown. The waveform is shown on the waveform displayer if the sentence has been recorded or the waveform displayer shows a black background. 圖2顯示了VoiceBank 網站上錄音頁面的圖形使用者界面(GUI)。左側窗格顯示了子語料庫的選擇以及其包含的段落。右側窗格從上到下依次為句子索引顯示器、波形顯示器、文本提示和錄音控制窗格。要錄製語料庫,用戶可以首先在左側窗格上選擇一個子語料庫以及其中的一個段落。一旦選擇了一個段落,將顯示該子語料庫段落的第一句的提示。如果已錄製該句子或波形顯示器顯示黑色背景,則波形顯示器上會顯示波形。 --- ![](https://hackmd.io/_uploads/BkCbMoSy6.png) Fig.2 : GUI of https://voicebank.ce.ntpu.edu.tw/ 圖二 : voicebank的圖形使用者介面 --- The user may push the green button on the recording control pane to start a take. When recording, the green button turns red, and the waveform displayer shows the recorded waveform in real time. The red button will turn green if a user presses the red button to stop recording. After stopping recording, the waveform recorded will be shown on the waveform displayer and saved in the backend of the VoiceBank website. Users may click on the waveform displayer to playback the recordedsentence. If a user is dissatisfied with the take, he/she can re-record the sentence. All the takes will be stored on the server,and only the last one will be shown to the user. The last takewill be regarded as a usable sample. 使用者可以按下錄音控制窗格上的綠色按鈕來開始錄製。在錄音時,綠色按鈕會變成紅色,而波形顯示器會即時顯示已錄製的波形。如果使用者按下紅色按鈕停止錄製,紅色按鈕將變為綠色。在停止錄製後,所錄製的波形將顯示在波形顯示器上並保存在VoiceBank 網站的後端。使用者可以點擊波形顯示器來回放所錄製的句子。如果使用者對錄製不滿意,他/她可以重新錄製該句。所有的錄製將存儲在伺服器上,並且只會顯示最後一個給使用者。最後一個錄製將被視為可用的樣本。 --- # 資料過濾與更正 (階段 2-4) 由於階段 2 至 4 的錄音採用自主錄音的方式,比起階段 1 透過專人現場錄製的音訊並及時修正文稿的方式,階段 2 至 4 的音訊品質可能會參差不齊。因此需要一套整理資料的流程,包括過濾品質不好的音檔與錯誤更正。Fig 3. 為此過程的流程圖,過程可以粗略的分為三個部分:檔案格式檢測、強制對齊 (force alignment) 檢測、語音辨識 (ASR) 檢測。 ![](https://hackmd.io/_uploads/B1ys3doTh.png) --- 首先,檔案格式檢測會移除毀損的檔案,例如標頭檔因故錯誤或是因網路問題導致波行長度為零的檔案。第二步,透過 SLMTK 進行線下強制對齊,可以將無法對齊的音檔文稿組合移除,其中 SLMTK 會執行文字分析、聲學參數抽取與文字對齊。文字分析會產出對應文字輸入的語言單位序列,文字對齊步驟會產出 TextGrid 檔案,其中包含許多時間上的語言單位資訊,通過此步驟我們可以移除不會產生 TextGrid 的組合,我們發現無法產生 TextGrid 通常代表語音的內容與文稿差太多。 第三步,我們使用語音辨識 (ASR) 的方式更進一步的檢查語音與文字的符合程度。我們使用 *Branchformer* E2E 模型,並使用 Aishell-1 (85 小時) 語料訓練一個預訓練模型,接著用 TCC-300 (25 小時) 與 6 小時經過測試正確與強制對齊檢測的 VoiceBank 語料合併調適預訓練模型。其中由於 Aishell-1 為中國大陸口音,因此需要使用 TCC-300 台灣口音的資料來調適成適合台灣口音的模型。另外由於 *Branchformer* 不能接受太長的輸入,因此我們將每個音檔切短當作模型輸入,根據強制對齊產出的對齊資訊,我們可以得到以靜音和標點作為基準,預先分割好的句子。 將所有的 VoiceBank 語料經過上述模型測試,得到的 CER 和 SER 分別為 4.0% 和 9.4% (共 32,104 預處理短句)。因為 CER 非常低,因此我們可以相信這個便是結果有助於幫助錯誤的更正。最後我們手動檢查這些短句,並將單句 CER 高於 30% 的句子挑出,移除修正困能的並修正易於處理的文稿。 # Samples ## Phase-2, speakerid=019, degree of the speech impairment=1 # Download The VoiceBank-2023 is available by request for non-commercial use. Please email to Prof. Chen-Yu Chiang, NTPU, Taiwan, for the request: cychiang@mail.ntpu.edu.tw # V.LABELING AND METADAT Each utterance of the VoiceBank-2023 corpus has a WAVE file, a raw text file, and a TextGrid file. In the following, we illustrate the content of TextGrid along with WAVE file opened by Praat [27] and the metadata derived from the corpus VoiceBank-2023 語料庫的每個發音都有一個 WAVE 檔案、一個原始文字檔案和一個 TextGrid 檔案。下面,我們將說明 TextGrid 的內容以及 Praat [27] 開啟的 WAVE 檔案和從語料庫中提取的metadata A TextGrid Labeling Fig. 4 shows an exemplar TextGrid file opened with the corresponding WAVE file. The upper two panes are displayed waveform and spectrogram with F0 trajectory. The lower five panes show five tiers of time alignments, from top to down, HMM (acoustic units of initial and final), SyllableTone (syllable tone), Word (lexical word), POS (part of speech as-sociated with Word), and PM (sentence-like unit delimited by punctuation marks). The TextGrid labelings of the corpus can be used in constructing speech models that need information about linguistic units and the associated time alignments 圖 4 顯示了用相應的 WAVE 檔案開啟的 TextGrid 檔案範例。上面兩個面板顯示的是波形和帶有 F0 軌跡的頻譜圖。下部的五個面板顯示了五層時間排列,從上到下依次為:HMM(首尾聲學單位)、SyllableTon音調、Word(詞性)、POS(與Word 相關的語篇)和PM(由省略號分隔的 句子類單位)。 語料庫的 TextGrid 標籤可用於建立需要有關語言單位和相關時間對齊資訊的語音模型。 B. Metadata Since Phases 1-4 recorded speaks’ utterances with registra-tion information stored in a database management system, the following metadata can be obtained along with the TextGrid labeling files: speaker types, gender, the number of recording takes, time elapsed for recording each utterance (seconds), speech rate, articulation rate, estimated signal-to-noise ratio (SNR). The degree of dysarthria was labeled for each speaker Fig. 4: Examplar WAVE and TextGrid files opened by Praat with four levels (degrees). The four degrees regard the follow-ing features with informal subjective tests of listening to one to three utterances: speech intelligibility, dysarthria, fluency regarding prosody, and speaking rate. The four degrees and the corresponding characteristics are 1) degree 1: fluent speech without speech impairment, 2) degree 2: disfluent in prosody, 3) degree 3: light dysarthria but high speech intelligibility, 4) degree 4: dysarthria and low speech intelligibility. We also label each speaker with sound quality regarding recording using informal listening tests. The sound quality is measuredby a five-point mean opinion score (MOS), in which 1 and 5 points represent low and high sound qualities, respectively 由於第1 至第4 階段錄製的語者語句的註冊資訊儲存在資料庫管理系統中,因此可以透過TextGrid 標籤檔案取得以下元資料:語者類型、性別、錄製次數、錄製每個語句所花費的時間( 秒)、語速、發音率、估計訊號雜訊比(SNR)。 圖 4:Praat 開啟的 Examplar WAVE 和 TextGrid 檔案有四個等級(程度)。 這四種程度是透過聽一至三句話的非正式主觀測試來確定以下特徵的:語音清晰度、構音障礙、流暢性和語速。 這四個程度和相應的特徵分別是:1)程度1:說話流利,無語言障礙;2)程度2:前語不流利;3)程度3:構音障礙輕,但語言理解能力高;4 )程度4:構音障礙和語言理解能力低。 我們也透過非正式的聽力測試為每位說話者的錄音貼上音質標籤。 聲音品質由五點平均意見分(MOS)來衡量,其中 1 分和 5 分分別代表聲音品質低和聲音品質高 Some significant statistics can be found in the metadata: 1) Speakers with higher dysarthria degrees have lower speech and articulation rates. 2) More recording takes were made for Part-1: VoiceBanking than Part-2: Common Phrases (1.9 v.s. 1.3 takes on aver-age), showing that short utterances alleviate the workload. 3) Speakers’ self-service recording with quality USB micro-phones in Phases 2-4 generally made recorded utterances have similar or even higher SNR than the on-site recorded utterances in Phase 1. This partially confirms the feasibility of online web-based recording for voice-bankin 在metadata中可以找到一些重要的統計數據: 1) 構音障礙程度較高的發言者的語音和發音率較低。 2) 第一部分:語音銀行的錄音次數多於第二部分:常用片語的錄音次數(平均為 1.9 次對 1.3 次),這表示片語可以減輕工作量。 3) 在第 2-4 階段,發言人使用高品質 USB 麥克風進行自助錄音,錄製的語音信噪比一般與第 1 階段現場錄製的語音訊噪比相近甚至更高。 這在一定程度上證實了語音庫線上網路錄音的可行性。 # VI. 用於建立個人化 TTS 系統的語料庫評估 我們利用由 Phase-1 語者所提供的 Part-1 VoiceBanking 子語料庫,建立 TTS,並進行評估。 此 TTS 係一模組化系統,可表示為 $speech = TTS(text)=WG(SG(PG(TA(Text))$。這些模組(函數)為: 1. TA(Text Analysis,文本分析):用於產生語言特徵,由基於規則的文本正規化、基於CRF的斷詞及詞類標記器、辭典-資料驅動混合式的字素轉音位,三個子模組串接而成。 2. PG(Prosody Generation,韻律產生):使用語者調適的韻律模型,從TA給出的語言特徵產生韻律參數。 3. SG(Speech parameter Generation,語音參數產生):由PG給出的特徵生成以音框為單位的梅爾廣義倒頻譜(MGC)、對數基礎(logF0)頻率、有聲/無聲(U/V)旗標。SG的子模組和的具有狀態時長模組、對數基礎頻率模組和 MGC 模組的傳統 HTS 語音合成器相似。狀態時長為一以語者嵌入做為隱藏層偏移值的五層 CNN。聲學模組為一以語者嵌入做為隱藏層偏移值的四層 CNN 加一層 LSTM 的堆疊,用於產生基於音框的 MGC, logF0, 及 U/V。 4. WG(Waveform Generation,波型生成):使用 WORLD 聲碼器。 離線版的 SLMTK 1.0 給說話語料庫標上語言-說話對齊及韻律標籤。並利用此標記用調適及多語者的方式分別建立 PG 和 SG 模型。由 15 位收錄的 ALS 病友及 17 位病友的照護員對建立好的個人化 TTS 進行評分,分別得到 MOS 3.92 及 3.55 的相似度。此給果表示 Part-1 VoiceBanking 子語料庫可以用來建立可接受的個人化 TTS 系統。 # VII.結論與未來研究 This paper reports the design, recording, data purging, and correction for the VoiceBank-2023 speech corpus. The useful-ness of the Part-1 (VoiceBanking) sub-corpus of VoiceBank-2023 for constructing personalized TTS systems was evaluated with reasonable MOS in speaker similarity. With the statistics of the number of recording takes and the feedback from the enrolled patients and voice donors, we will shorten the prompts of Part-1 (VoiceBanking) sub-corpus displayed on the GUI to make speakers easier to read. Besides continuing to improve the performance of the personalized TTS systems, constructing automatic mechanisms to label the degrees of dysarthria, voice quality, and sound quality labelers will be worthwhile. 本論文報告了VoiceBank-2023語音語料庫的設計、錄製、數據清理和修正。VoiceBank-2023的Part-1(VoiceBanking)子語料庫在構建個人化TTS系統方面的實用性已經通過合理的MOS評估在語音相似性方面得到了驗證。根據錄音次數統計及註冊患者和語音捐贈者的反饋,我們將縮短顯示在GUI上的Part-1(VoiceBanking)子語料庫的提示,使語者更容易閱讀。除了繼續改進個人化TTS系統的性能外,建立自動機制來標記構音障礙、語音品質和音質的程度標籤將是值得的。