Try   HackMD

The Speech Labeling and Modeling Toolkit (SLMTK) Version 1.0

Authors:
Chen-Yu Chiang, cychiang@mail.ntpu.edu.tw, National Taipei University, Taiwan
Wu-Hao Lee, hank12451@gmail.com, National Taipei University, Taiwan
Yen-Ting Lin, d26923050@gmail.com, National Taipei University, Taiwan
Jia-Jyu Su, boss.su2000@gmail.com, National Taipei University, Taiwan
Wei-Cheng Chen, weichengchen@acoustintek.com, AcoustInTek Co., Ltd., Taiwan
Cheng-Che Kao, erickao@acoustintek.com, AcoustInTek Co., Ltd., Taiwan
Shu-Lei Lin, st23990504@gmail.com, National Taipei University, Taiwan
Pin-Han Lin, joseph861030@gmail.com, National Taipei University, Taiwan
Shao-Wei Hong, pisces860224@gmail.com, National Taipei University, Taiwan
Guan-Ting Liou, tn00320663@gmail.com, National Yang Ming Chiao Tung University, Taiwan
Wen-Yang Chang, chrischang@acoustintek.com, AcoustInTek Co., Ltd., Taiwan
Jen-Chieh Chiang, jackchiang@acoustintek.com, AcoustInTek Co., Ltd., Taiwan
Yen-Ting Lin, ice.lin@udngroup.com.tw, udnDigital Co., Ltd., Taiwan
Yih-Ru Wang, yrwang@speech.cm.nctu.edu.tw, National Yang Ming Chiao Tung University, Taiwan
Sin-Horng Chen, schen@nycu.edu.tw, National Yang Ming Chiao Tung University, Taiwan

摘要(Abstract)

此篇論文介紹了Speech Labeling and Modeling Toolkit version 1.0 (SLMTK 1.0),它促進了在建置語音合成系統(以下皆以TTS稱呼語音合成系統)及分析語音語調的自動標註文本和語音。SLMTK 1.0支援原始的中文-英文文本以及相應的國語(※這邊Mandarin可以做國語或大陸稱做普通話,看老師定奪)-英語語音輸入。這些輸入經由下列七個子模型依序進行處理: 1) 文字分析,2) 聲學參數抽取,3) 語音-語音的時間對齊,4)音節語言與韻律聲學參數之整合,5)韻律標記,6)建立韻律生成模型,7)建立語音合成聲學模型。七個子模型的輸出分別為:1) 語言參數,2) 聲學參數,3) 語言-語音的時間對齊,4) 音節語言與韻律聲學參數,5) 韻律標籤,6) 韻律生成模型,7) 語音合成聲學模型。SLMTK 1.0已被應用於構建一些商業TTS和用於增強和替代溝通的客製化TTS。該工具包還被應用於L2中文語料的語音和語調標註,以加速和促進有關語調分析的研究。SLMTK 1.0的服務可在slmtk.ce.ntpu.edu.tw網站上進行非商業用途的線上使用。SLMTK工作小組歡迎所有群體來豐富SLMTK的功能。

This paper introduces the Speech Labeling and Modeling Toolkit version 1.0 (SLMTK 1.0), which facilitates automatic labeling of text and speech for construction of text-to-speech (TTS) systems and analysis of speech prosody. The SLMTK 1.0 supports inputs of raw Chinese-English mixed text and the associated mixed Mandarin-English speech. The inputs are then processed by the following seven modules in sequence: 1) text analysis, 2) acoustic feature extraction, 3) linguistic-speech alignment, 4) integration of syllable-based linguistic and prosodic-acoustic features, 5) prosody labeling, 6) construction of prosody generation model, and 7) construction of acoustic models for speech synthesis. The outputs of the seven modules are, correspondingly, 1) linguistic labels, 2) acoustic features, 3) linguistic-speech alignment, 4) syllable-based linguistic and prosodic-acoustic features, 5) prosody tags, 6) prosody generation model, and 7) acoustic models for speech synthesis. The SLMTK 1.0 has been applied to constructing some commercial TTSs and personalized TTSs for augmentative and alternative communication. The toolkit also has been applied to phonetic and prosodic labeling of L2 Mandarin speech for accelerating and facilitating the studies about prosody analysis. The service of SLMTK 1.0 is available online at slmtk.ce.ntpu.edu.tw for non-commercial use. The SLMTK working group welcomes all parties to enrich the functions of the SLMTK.

摘要

Index Terms: text-speech alignment, prosody labeling, prosody modeling, text-to-speech, Mandarin, mixed Mandarin-English


1. 簡介 (Introduction)

Speech Labeling and Modeling Toolkit version 1.0 (SLMTK 1.0) 促進了在建置TTS系統。SLMTK使用語音和韻律聲學資訊標註語料庫。這些標註過的資訊可用於構建「韻律生成模型」以及「語音合成聲學模型」,同時也可用於語音分析。為了方便使用者,標註的信息主要以Praat的TextGrid格式保存。

The Speech Labeling and Modeling Toolkit version 1.0 (SLMTK 1.0) is built to facilitate constructing text-to-speech (TTS) systems. The SLMTK labels speech corpora with linguistic and prosodic-acoustic information. The labeled information can be used in constructing models for prosody generation and speech synthesis, but it can also be used for speech analysis. For users’ convenience, the labeled information is mainly saved in Praat’s TextGrid format [1].

[1] TextGrid file formats, Available: https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html


在 SLMTK version 1.0 公佈以前,SLMTK 的工具已支援「回聲計畫」建立了 20 位漸凍症病友的客製化TTS for Augmentative and Alternative Communication (AAC)。SLMTK Working Group 將語者的文字和語音輸入 SLMTK 之後,可產生出與語音上時序對應的語言參數、語音切割、韻律標記。接下來根據獲得的對齊資訊即構建出基於決策樹的語調生成以及基於HMM/DNN的TTS的聲學參數生成。自2021年8月以來,這20位漸凍症病友的客製化TTS系統已經在線上提供。患者可以登錄網絡TTS服務,以使用合成的語音進行交流。

Before this release of the SLMTK 1.0, the core SLMTK functions have already supported constructing personalized TTS systems used in augmentative and alternative communication (AAC) for 20 amyotrophic lateral sclerosis (ALS) patients [2,3,4]. With the inputs of raw texts and the associated patients’ speech utterances, alignments between the speech, linguistic features, and prosodic features/tags were automatically produced by the SLMTK. The decision tree-based prosody generation and the HMM/DNN-based acoustic parameter generations for TTS were then constructed with the alignment information obtained. Since August 2021, the personalized TTS systems for the 20 ALS patients have been available online. The patients can log in to the web-based TTS service as they would like to use their synthesized speech to communicate.

[2] Yi-Hung Liu, Chen-Yu Chiang, Chung-Kan Peng, Fu‐Chi Yang, and Yu-Kai Lin, “Development of a Smart Sclerosis Communication System for Patients with Als: Outcome Add-On and Application" Ministry of Science and Technology, Taiwan, 2021, Available: https://www.grb.gov.tw/search/planDetail?id=13443860 (in Chinese)
[3] The Revoice Project. 2021 [Online] Available: https://hackmd.io/@cychiang-ntpu/BJ5Hfv2s_ (in Chinese)
[4] Chen-Yu Chiang, Yan-Ting Lin, Wu-Hao Lee, Jia-Jyu Su, Wei-Cheng Chen, Cheng-Che Kao, Shu-Lei Lin, Pin-Han Lin, "Project Save and Sound: Constructing Personalized Mandarin Text-to-Speech Systems for ALS Patients," prepare for submission to Augmentative and Alternative Communication Available: https://hackmd.io/@cychiang-ntpu/Bk7_iARM9


此外,SLMTK也被應用於提供精確的時間對齊信息,用於研究拉丁語系學習者對華語中語調分析的中介語言

Besides, the SLMTK also has been applied to providing precise time alignment information for the study of Tone Analysis of Chinese Interlanguage by Romance Language Learners [5].

[5] Bo-xuan Huang, Ching-hong Lai, and Te-hsin Liu, "On the Interaction between L1 Transfer and Universal Constraints–Evidence from the Acquisition of Mandarin Tones by French Speakers," submiited to North American Conference on Chinese Linguistics, to be held in September 24-25 2022, USA.


SLMTK 1.0被構建並報告的目的是分享並繼續開發SLMTK的功能。SLMTK 1.0的服務可在slmtk.ce.ntpu.edu.tw網站上進行非商業用途的線上使用。用戶無需撰寫程式只需上傳語音語句和相關文本到SLMTK 1.0網站便能獲得TTS demo系統和相關的語言、聲學和語調標籤。SLMTK工作小組歡迎所有群體來給予回饋並豐富SLMTK的功能。

The SLMTK 1.0 is constructed and reported in this paper to share and continue developing the functions of the SLMTK. The service of the SLMTK 1.0 is available online at slmtk.ce.ntpu.edu.tw for non-commercial use. The users do not need to write programs but to upload speech utterances and the associated text to the SLMTK 1.0 website for obtaining TTS demo systems and the associated linguistic, acoustic, and prosodic labels. The SLMTK working group welcomes all parties to use the SLMTK 1.0 and encourages users to give feedback on their suggestions, enriching the functions of the SLMTK.








G


cluster_Input

Input


cluster_I

Input Corpus


cluster_T

Speech Labeling and Modeling


cluster_P

SLMTK version 1.0


cluster_Output

Output



I2

*.wav



M2

SLMTK Tools



I2->M2





I1

*.txt



I1->M2





O3

Modified 
Corpus Label Files (
*.TextGrid)



O3->M2





M0


SLMTK
Prior Models &
Codebase



M0->M2





O1

Corpus Label Files 
(*.TextGrid)



M2->O1





O4

TTS Demos



M2->O4





O2

TTS Models



M2->O2





M3

Manual Correction
(with Praat or other tools)



M3->O3





O1->M3





圖 1: SLMTK 的使用概念圖


SLMTK 1.0 的使用概念如圖 1 所示。用戶登錄後,可以將由語者錄製的語音語料庫上傳到SLMTK服務。語料庫包含以(.wav)格式保存的語音語句和以UTF-8編碼的原始文本(.txt)。接下來便開始使用 SLMTK 進行「語音標記以及建模」(Speech Labeling and Modeling),其運作來自預訓練好的 SLMTK Prior Models [6-12]。SLMTK 1.0對輸入的語料庫進行以下七個步驟的處理:

  1. 文字分析 (text analysis)
  2. 聲學參數抽取 (acoustic feature extraction)
  3. 語音-語音的時間對齊 (linguistic-speech alignment)
  4. 音節語言與韻律聲學參數之整合 (integration of syllable-based linguistic and prosodic-acoustic features)
  5. 韻律標記 (prosody labeling)
  6. 建立韻律生成模型 (construction of prosody generation model)
  7. 建立語音合成聲學模型 (construction of acoustic models for speech synthesis)

The use of the SLMTK version 1.0 is shown in Fig. 1. After login, the users may upload the speech corpus recorded by a speaker to the SLMTK service. The corpus contains speech utterances saved in *.wav and the raw texts (*.txt) encoded in UTF-8. The users may then start to conduct speech labeling and modeling with the SLMTK powered by pre-trained prior speech processing models [6-12]. The SLMTK 1.0 processes input corpus with the following seven steps:

  1. text analysis
  2. acoustic feature extraction
  3. linguistic-speech alignment
  4. integration of syllable-based linguistic and prosodic-acoustic features
  5. prosody labeling
  6. construction of prosody generation model
  7. construction of acoustic models for speech synthesis

[6] C. Y. Chiang, "Cross-Dialect Adaptation Framework for Constructing Prosodic Models for Chinese Dialect Text-to-Speech Systems," IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 26, no. 1, pp. 108-121, Jan. 2018.
[7] Chung-Yao Tsai, I-Bin Liao, Chin-Kuan Kuo, Chen-Yu Chiang, Yih-Ru Wang, Sin-Horng Chen, “Hierarchical Prosody Modeling of English Speech and its Application to TTS,” 17th International Conference Oriental COCOSDA, Phuket, Thailand, Sept. 2014
[8] Cheng-Yeh Tsai, “Prosody Hierarchy Construction for Mixed Chinese-English Spelling Speech and its Application to TTS,” Master’s thesis, National Chiao Tung University, Hsinchu, Taiwan, 2010 (in Chinese).
[9] G. T. Liou, C. Y. Chiang, Y. R. Wang and S. H. Chen, “Estimation of Hidden Speaking Rate,” Speech Prosody, Poznan, Poland, 2018
[10] Guan-Tin Liou, Chen-Yu Chiang, Yih-Ru Wang and Sin-Horng Chen, “An Exploration of Local Speaking Rate Variations in Mandarin Read Speech,” Interspeech 2018.
[11] CY Chiang, GT Liou, YR Wang, SH Chen, "Method of generating estimated value of local inverse speaking rate (ISR) and device and method of generating predicted value of local ISR accordingly", US Patent 11,200,909, 2021
[12] I. B. Liao, C. Y. Chiang, Y. R. Wang, S. H. Chen, “Speaker Adaptation of SR-HPM for Speaking Rate-Controlled Mandarin TTS,” IEEE Trans. Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2046–2058, Aug. 2016.


當七個步驟完成後,SLMTK會在SLMTK的網頁上創建TTS演示系統。此外,SLMTK會輸出包含語言、聲學和韻律-聲學特徵的語料標籤文件。這些語料標籤文件主要以TextGrid格式[1]保存,方便用戶查詢。

After the seven steps are done, the SLMTK creates the TTS demo system on the webpage of the SLMTK. In addition, the SLMTK outputs the corpus label files that contain linguistic, acoustic, and prosodic-acoustic features. The corpus label files are primarily saved in TextGrid format [1], making users easy to probe.


由於輸入文本和語音之間的不一致或SLMTK先驗模型的穩健性較低,SLMTK可能會生成錯誤的語料標籤。用戶可以使用GUI編輯工具,例如Praat [13]和wavesurfer [14],手動修改這些錯誤的標籤。修改後的標籤文件可以插入到SLMTK的中間步驟中。用戶可以重新運行這些中間步驟,使模型更加穩健。此外,在Praat的TextGrid中保存的這些語料標籤文件提供了有關語言-語音對齊和韻律標籤的信息,減輕了語音分析的手動標註的繁重工作負擔。

The SLMTK may produce erroneous corpus labels due to the inconsistency between input text and speech, or the low robustness of the SLMTK prior models. Users could modify these erroneous labels manually using GUI editing tools, such as Praat [13] and wavesurfer [14]. The modified label files can be inserted into the intermediate steps in the SLMTK. The users may re-run these intermediate steps to make the models more robust. Besides, these corpus label files saved in Praat’s TextGird provide information about linguistic-speech alignment and prosodic tag, alleviating the heavy workload of manual speech labeling for speech analysis.

[13] P. Boersma and D. Weenink, Praat: doing phonetics by computer. 2022. [Online]. Available: https://www.fon.hum.uva.nl/praat/
[14] WaveSurfer. 2022. [Online]. Available: https://sourceforge.net/projects/wavesurfer/


本論文的架構如下。第2節介紹了SLMTK支持的TTS框架。第3節展示了SLMTK 1.0的框架以及語音標註和建模的七個步驟。第4節描述了SLMTK 1.0每個步驟的設計。第5節討論了使用SLMTK的潛在優勢和相關工作。最後,在第6節中,我們總結了SLMTK 1.0的內容並提出未來的研究方向。

This paper is organized as follows. Section 2 introduces the TTS framework that the SLMTK supports. The framework of the SLMTK 1.0 and the seven steps for speech labeling and modeling are illustrated in Section 3. Section 4 describes the design of each step of the SLMTK 1.0. The potential advantages of using the SLMTK and related works are discussed in Section 5. Last, we summarize the SLMTK 1.0 and give the future works in Section 6.


2. 知識豐富的文字轉語音系統架構 (Knowledge-Rich TTS Framework)

?
TTS(文本轉語音)的開發通常始於使用設計好的文本材料錄製語音發音。然後,設計映射函數,將文本映射到語音。

The development of a TTS commonly starts with recoding speech utterances given with designed text material. The mapping functions are then designed to map from text to speech.


SLMTK 所支援的 TTS 架構如下述的功能組合,用於將文本轉換為語音:

speech=TTS(text)=WG(SG(PG(TA(text))))

以上函式包括:文字分析 (text analysis, TA)、韻律產生 (prosody generation, PG)、語音聲學參數產生合成 (speech parameter generation, SG)、以及語音信號產生 (waveform generation, WG),其流程圖如圖 2 所示。



The SLMTK supports the TTS framework as stated in the following function composition to map from text to speech:

speech=TTS(text)=WG(SG(PG(TA(text))))

The functions are:

  1. TA: text analysis
  2. PG: prosody generation
  3. SG: speech parameter generation
  4. WG: waveform generation

The corresponding flowchart is shown in Figure 2.








%0



T
文字
(x: text)



TA

文字分析
(TA: text analysis)



T->TA





L
語言參數
(L: linguistic features)



TA->L





PG

韻律產生
(PG: prosody generation)



L->PG





SG

語音聲學參數產生合成
(SG: speech parameter generation)



L->SG





P
韻律參數
(P: prosody parameters)



PG->P





P->SG





A
聲學參數
(A: acoustic parameter)



SG->A





WG

語音信號產生器聲碼器
(WG: vocoder/waveform generation)



A->WG





S
合成語音
(y: synthesized speech)



WG->S





VM
聲碼器模型
(VM: vocoding model )



VM->WG





PGM
韻律模型
(PGM: prosody generation model)



PGM->PG





AM
語音合成聲學模型
(AM: acoustic model for speech synthesis)



AM->SG





圖 2: Knowledge-rich TTS Framework







%0


cluster_UP

utterance planning


cluster_MP

message planning


cluster_MS

motor command generation and
speech sound production



T
text



TA

TA: text analysis



T->TA





L
     linguistic features



TA->L





PG

PG: prosody generation



L->PG





SG

SG: speech parameter generation



L->SG





P
prosody parameters



PG->P





P->SG





A
            acoustic parameter



SG->A





WG

WG: vocoder/waveform generation



A->WG





S
synthesized speech



WG->S





VM
      VM: vocoding model



VM->WG





PGM
PGM: prosody generation model



PGM->PG





AM
         AM: acoustic model for speech synthesis  



AM->SG







TTS框架受到語音生成過程的啟發[15,16]。這個過程可以通過在語言學和信號處理方面獲得的知識來解釋。因此,我們稱這個框架為"知識豐富的TTS框架"。

The TTS framework is inspired by the speech production process [15,16]. The process can be interpretable by the knowledge gained in linguistics and signal processing. We, therefore, call this framework "knowledge-rich TTS framework."

[15] Fujisaki, H. (2003). Prosody, information, and modeling: with emphasis on tonal features of speech. In Proceedings of the Workshop on Spoken Language Processing, Mumbai, India (pp. 5–14).
[16] Levelt et al. - 1999—A theory of lexical access in speech production.pdf. (n.d.).
Levelt, W. J. M., Roelofs, A., & Meyer, A. S. (1999). A theory of lexical access in speech production. BEHAVIORAL AND BRAIN SCIENCES, 76.


TA(文本分析)從輸入文本中提取語言信息。語言信息包括詞彙、句法和部分語義特徵。文本是由人類定義的符號組織的表示方法,用於記錄語音想要傳達的信息。TA可以被視為消息計劃過程的反向過程,它將文本中表示的消息轉換為語言信息,以便語音傳達者能夠表達。

The TA extracts linguistic information from the input text. The linguistic information contains lexical, syntactic, and partly semantic features. Text is a representation organized by the symbols defined by humans to record the information that a speaker wants to convey. The TA can be regarded as an inverse of the message planning process that converts from a message represented in a text to linguistic information that a speaker wants to convey.


PG(韻律生成)通過TA提取的語言特徵以及韻律生成模型(PGM)生成韻律信息。在SLMTK中考慮的韻律信息包括韻律斷點、韻律狀態和基於音節的韻律-聲學特徵。需要注意的是,韻律斷點和韻律狀態可以代表一個韻律結構。基於音節的韻律-聲學特徵是每個韻律單元(即SLMTK中的音節)的韻律實現。因此,PG類似於發聲計劃過程,將計劃的消息轉換為韻律實現。

The PG produces prosodic information by the prosody generation model (PGM) given with the linguistic features extracted by the TA. The prosodic information considered in the SLMTK contains prosodic breaks, prosodic states, and syllable-based prosodic-acoustic features. Note that prosodic breaks and prosodic states can represent a prosodic structure. The syllable-based prosodic-acoustic features are prosodic realizations of each prosodic unit (i.e., syllable in the SLMTK). Therefore, the PG is analogous to the utterance planning process that converts planned messages to prosodic realizations.


SG(聲學生成)通過語音合成的聲學模型生成控制基頻和頻譜包絡的語音參數。最後,WG(語音生成)根據語音參數使用一些語音編碼模型生成語音信號。SG和WG過程模擬人類的運動命令生成和語音聲音產生過程。

The SG generates speech parameters that control the fundamental frequency and spectral envelope by the acoustic model for speech synthesis. Last, the WG produces speech signals by some vocoding models given the speech parameters. The SG and WG processes simulate humans’ motor command generation and speech sound production.


3. 語音標記以及建模架構 SLMTK 1.0 (Speech Labeling and Modeling Framework for SLMTK Version 1.0)

圖3展示了SLMTK 1.0的框架。輸入為文本和相應的語音,最終我們獲得了由訓練過的韻律生成模型和語音合成的聲學模型支持的知識豐富的TTS系統。

Fig. 3 shows the framework of the SLMTK 1.0. The inputs are text and the associated speech, and last we obtain a knowledge-rich TTS system powered by the trained prosody generation model and acoustic model for speech synthesis.

首先,在第一步文本分析中,從輸入文本中提取語言特徵並將其保存為語言標籤文件。接著,在第二步聲學特徵提取中,生成基於幀的聲學特徵,例如頻譜圖、MFCC(梅爾頻率倒譜係數)和F0(基頻)。第三步語言-語音對齊使用提取的語言和聲學特徵標註語音單位的開始和結束時間。例如,在SLMTK 1.0中使用的語音單位包括普通話音節的初始和結束,以及英語單詞的音素。此外,某些較高層次的語言單位,如音節、單詞和類似句子的單位(由標點符號分隔)的時間間隔也可用。

First, step “1) text analysis” extracts linguistic features and saves them as linguistic label files from the input text. Then, step “2) acoustic feature extraction” produces frame-based acoustic features such as spectrogram, MFCC, and F0. Step “3) linguistic-speech alignment” labels beginning and end time instances of speech units with the linguistic and acoustic features extracted. For example, the speech units used in the SLMTK 1.0 are initial and final for Mandarin syllables and phonemes for English words. The time intervals for some upper-layer linguistic units, e.g., syllables, words, and sentence-like units (delimited by punctuation marks), are also available.








%0


cluster_TTS




I1
文字 (*.txt)
text (*.txt)



A

文字分析
text analysis



I1->A





I2
語音 (*.wav)
speech (*.wav)



Q

聲學參數抽取
acoustic feature extraction



I2->Q





E

建立語音合成聲學模型
construction of acoustic models for
speech synthesis



I2->E





L
語言參數檔 (*.llf)
linguistic label file (*.llf)



A->L





R
聲學參數檔
 acoustic features (*.f0)



Q->R





B

語言-語音的時間對齊
linguistic-speech alignment



R->B





X

音節語言與韻律聲學參數之整合
integration of syllable-based linguistic and prosodic-acoustic features



R->X





R->E





L->B





Z
語言-語音的時間對齊資訊 (*.TextGrid)
 linguistic-speech alignment (*.TextGrid)



B->Z





Z->X





U
音節語言與韻律聲學參數檔 (*.slp)
syllable-based linguistic and prosodic-acoustic feature file (*.slp)



X->U





C

韻律標記
prosody labeling



U->C





T
韻律標籤檔 (*.TextGrid)
prosody tag files (*.TextGrid)



C->T





D

建立韻律產生模型
construction of prosody generation model



T->D





T->E





G
韻律生成模型
PGM: prosody generation model



D->G





F
語音合成聲學模型
AM: acousitc model for speech synthesis



E->F





H

Knowledge-Rich TTS



F->H





G->H





HO
synthesized speech



H->HO





HI
text



HI->H





圖 3: Speech Labeling and Modeling Framework







%0


cluster_TTS




I1
text (*.txt)



A

1) text analysis



I1->A





I2
speech (*.wav)



Q

2) acoustic feature extraction



I2->Q





E

7) construction of acoustic models
for speech synthesis



I2->E





L
linguistic label files (*.llf)



A->L





R
acoustic feature files (*.f0)



Q->R





B

3) linguistic-speech alignment



R->B





X

4) integration of syllable-based linguistic and
prosodic-acoustic features



R->X





R->E





L->B





Z
linguistic-speech alignment files (*.TextGrid)



B->Z





Z->X





U
syllable-based linguistic and prosodic-acoustic
feature files (*.slp, *.TextGrid)



X->U





C

5) prosody labeling



U->C





T
prosody tag files (*.TextGrid)



C->T





D

6) construction of prosody
generation model



T->D





T->E





G
PGM: prosody generation model



D->G





F
AM: acousitc model for speech synthesis



E->F





H

Knowledge-Rich TTS



F->H





G->H





HO
synthesized speech



H->HO





HI
text



HI->H







接下來,韻律標註需要觀察超音段的韻律-聲學特徵,以標註具有韻律結構的語音。在進行標註之前,SLMTK通過第四步"基於音節的語言和韻律-聲學特徵集成"提取音節基礎的韻律-聲學和語言特徵。SLMTK將音節視為韻律片段的基本單位。因此,基於音節的韻律-聲學特徵形成超音段的韻律特徵,用於韻律標註。

Next, the prosody labeling requires observing suprasegmental prosodic-acoustic features to label speech with prosodic structure. Before the labeling, the SLMTK extracts syllable-based prosodic-acoustic and linguistic features by step “4) integration of syllable-based linguistic and prosodic-acoustic features.” The SLMTK regards syllables as the basic units for prosodic segments. Therefore, the syllable-based prosodic-acoustic features over an utterance form suprasegmental prosodic features for prosody labeling.


然後,在第五步"韻律標註"中,使用以下韻律標籤對語音語料庫進行標註:韻律、韻律斷點類型和韻律狀態,並將它們保存為韻律標籤文件。接著,在第六步"韻律生成模型構建"中,根據韻律標籤文件學習從語言特徵到韻律信息的映射。最後,在第七步"語音合成的聲學模型構建"中,通過韻律標籤文件、聲學特徵和輸入波形建立語音合成的聲學模型。

Then, step “5) prosody labeling” labels the speech corpus with the following prosodic tags: tone, prosodic break type, and prosodic state, and saves them as prosodic tag files. Next, step “6) construction of prosody generation model” learns the mapping from linguistic features to prosodic information given with the prosody tag files. Last, step “7) construction of acoustic models for speech synthesis,” builds acoustic models for speech synthesis by the derived prosody tags file, acoustic features, and input waveforms.


4. SLMTK Version 1.0 的設計 (Design of the SLMTK Version 1.0)

This section describes the designs of the seven modules in the SMLTK 1.0.

4.1. 文字分析 (Text Analysis)

文本分析接受以UTF8格式編碼的原始文本文件(.txt),其中包含繁體中文和英文字符。輸出文件稱為語言標籤文件(.llf),包含以下信息:1)語言/發音單位,2)聲調,3)音節邊界,4)詞彙,5)詞性(POS),以及6)標點符號。每個語言標籤文件(.llf)與每個輸入的原始文本文件(.txt)相關聯。詞語標記器使用CRF++工具[17,18,20]實現,以詞邊界劃分輸入字符,並使用CKIP POS標準[19]標記每個詞的詞性。對於中文詞語的發音標註採用了一種混合的發音標註方法,包括詞典查找和基於CRF模型的標註方法[21]。對於英文單詞的發音標註則是使用詞典查找和實現了CMU拼音規則[20]的混合方法。

The text analysis accepts inputs of raw text files (.txt) with traditional Chinese and English characters encoded in UTF8 format. The outputs are called linguistic label files (.llf), containing information about 1) linguistic/pronunciation unit, 2) lexical tone, 3) syllable boundary, 4) lexical word, 5) part-ofspeech (POS), and 6) punctuation mark. Each linguistic label file (.llf) is associated with each input raw text file (.txt). The word tagger is implemented by the CRF++ tool [17,18,20] to delimit input characters with word boundaries and label each word with POS in the CKIP POS standard [19]. Pronunciation labeling for Chinese words is a hybrid pronunciation labeling method incorporating dictionary lookup, and CRF model-based labeling method is implemented [21]. Pronunciation labeling for English words is a hybrid of dictionary lookup, and CMU letter-to-sound rules are implemented [20].

[17] CRF++: Yet Another CRF toolkit, available at http://crfpp.googlecode.com/svnltrunk/doc/index.html.
[18] A.-H. Lin, Y.-R. Wang, and S.-H. Chen, “Traditional Chinese parser and language modeling for Mandarin ASR,” In Proc. O’COCOSDA’13, Gurgaon, Idia, 25-27 Nov. 2013, pp. 1-5.
[19] K.-J. Chen and C.-R. Huang, “Part of speech (POS) analysis on Chinese language,” CKIP Technical Report No.93-05, Institute of Information Science, Academia Sinica, Taiwan, R.O.C., 1993.
[20] Wen-Yang Chang, Wu-Hao Lee, Yen-Ting Lin, Yen-Ting Lin, Ren-Jie Chiang, Pan Chen-Ming, Wern-Jun Wang and Chen-Yu Chiang (2017, Sep). A MANDARIN-ENGLISH MIXED TTS CONSTRUCTED BY INDEPENDENT MANDARIN AND ENGLISH CORPORA. Oriental COCOSDA 2017, Seoul.
[21] Guan-Ting Liou, Wen-Li Zhuang,Chen-Yu Chiang, Wern-Jun Wang, Pao-Ching Chen, Yih-Ru Wang, Sin-Horn Chen (2014, Sep). A Study on Polyphone Disambiguation and Tone 3 Sandhi Labeling for Traditional Chinese. 17th International Conference Oriental COCOSDA, Phuket, Thailand.

4.2. 聲學參數抽取 (Acoustic Feature Extraction)

聲學特徵提取接受以Waveform Audio File Format(*.wav)保存的語音波形。建議使用高於16KHz的取樣率和高於16位的位分辨率。輸出包括以下文件:1)以幀為單位的F0,帧間隔為10毫秒(*.f0),以及2)單聲道的wav檔,取樣率分別為16kHz和20kHz。由於F0是用RAPT [22]提取的,可能存在一些聲帶/無聲帶或雙音高/半音高錯誤,因此SLMTK 1.0輸出了提取的F0供用戶在需要時進行修改。其他聲學特徵,如頻譜圖和MFCC,不在此處輸出,因為它們無法手動進行更好的建模的修改/糾正。

The acoustic feature extraction accepts the speech waveform saved in Waveform Audio File Format (*.wav). A sample rate higher than 16KHz and a bit resolution higher than 16 bits are preferable. The output contains the following files: 1) frame-based F0 with 10ms frame interval (*.f0), and 2) *.wav in a mono channel and sample rates of 16kHz and 20kHz. Because F0 is extracted by the RAPT [22] with maybe some voiced/unvoiced or double/half pitch errors, the SLMTK 1.0 outputs the extracted F0 for users to be modified if necessary. Other acoustic features, such as spectrogram and MFCC, are not outputted here because they cannot be manually modified/corrected for better modeling.

[22] D. Talkin and W. B. Kleijn, “A robust algorithm for pitch tracking (RAPT),” Speech coding and synthesis, vol. 495, p. 518, 1995.

4.3. 語言-語音的時間對齊 (linguistic-speech alignment)

語言-語音對齊的輸入是由第4.1節生成的語言標籤文件(.llf)和由第4.2節提取的聲學特徵。需要注意的是,只有語言標籤文件(.llf)可供編輯,用於糾正文本分析生成的錯誤語言標註。聲學特徵直接從聲學特徵提取傳遞到由HTK 3.4.1 [23]驅動的強制對齊器。輸出是包含聲學單位(初始、結束或音素)、音節、語調、與詞性相關的單詞,以及類似句子的單位(由標點符號分隔)的對齊信息的TextGrid文件。

The inputs for the linguistic-speech alignment are the linguistic label files (.llf) made by Section 4.1 and the acoustic features extracted by Section 4.2. Note that only the linguistic label files (.llf) are editable for correcting erroneous linguistic labeling made by the text analysis. The acoustic features are directly passed from the acoustic feature extraction to the forced aligner powered by the HTK 3.4.1 [23]. The outputs are the TextGrid files that contain alignment information of acoustic units (initial, final, or phoneme), syllables, tones, words associated with POSs, and sentence-like units (delimited by punctuation marks).

需要注意以下幾個過程對於對齊是至關重要的:

  1. 從文本分析結果中(即語言標籤文件*.llf)將新的英文單詞及其發音添加到強制對齊器的默認英文詞典中,以確保語料庫中的所有英文單詞都可以進行對齊。
  2. 對於中文,強制對齊器進行多音字消岐,對於英文,會進行多重發音標註,以減輕手動糾正發音的工作量。
  3. 利用TIMIT語料庫(具有精確手動標註)構建的基於DNN的發音方式識別器輸出七種發音方式的概率。然後,使用動態時間扭曲技術將這些概率應用於從強制對齊器的對齊中得出的發音方式邊界,以進行細化。這種細化可以減少調整對齊以確保語調標註和建模所需的音節持續時間的可靠度所需的工作量。

Note that the following processes are essential for the alignment:

  1. Adding new English words from the text analysis results, i.e., linguistic label files (*.llf), with pronunciations to the default English dictionary for the forced aligner ensures all the English words in the corpus can be aligned.
  2. The forced aligner does polyphone disambiguation for Chinese and multiple pronunciation labeling for English to alleviate labor in manual corrections of pronunciations.
  3. A DNN-based articulatory manner recognizer constructed by the TIMIT corpus (with precise manual labeling) outputs probabilities of the seven articulatory manners. Then, the dynamic time warping technique with the probabilities is applied to refine the manners' boundaries derived from the force aligner's alignment. The refinement may reduce labor in adjusting alignments to ensure robust measuring of syllable durations for the prosody labeling and modeling.

[23] HTK Web-Site. 2009 [Online]. Available: http://htk.eng.cam.ac.uk.

4.4. 音節語言與韻律聲學參數之整合 (integration of syllable-based linguistic and prosodic-acoustic features)

輸入是由語言-語音對齊(第4.3節)生成的TextGrid文件和聲學特徵提取(第4.2節)提取的以幀為單位的F0/功率。請注意,TextGrid中的信息和基於幀的F0是可編輯的,以獲得準確的韻律-聲學和相關的語言特徵。輸出包括TextGrid文件和基於音節的語言和韻律-聲學特徵文件。每個TextGrid文件包含每個輸入語音的語言-語音對齊信息和基於音節的韻律-聲學特徵。音節基礎的語言和韻律-聲學特徵文件(.slp)是一個UTF-8編碼的文本文件,包含了輸入語料庫的音節基礎的語言和韻律-聲學特徵。用戶可以查看.slp文件來探索語料庫。音節基礎的韻律-聲學特徵包括音節的logF0軌跡(由三階Legendre多項式[24]表示)、音節持續時間、音節能量水平、音節間的停頓持續時間和音節間的能量下降。

The inputs are the TextGrid files produced by the linguistic-speech alignment (Section 4.3) and the frame-based F0/power extracted by the acoustic feature extraction (Section 4.2). Note that the information in the TextGrid and the frame-based F0 are editable for the accurate prosodic-acoustic and associated linguistic features. The outputs are TextGrid files and the syllable-based linguistic and prosodic-acoustic feature file. Each TextGrid file contains both linguistic-speech alignment information and syllable-based prosodic-acoustic features for each input utterance. The syllable-based linguistic and prosodic-acoustic feature file (*.slp) is a UTF-8 encoded text file with the input corpus’s syllable-based linguistic and prosodic-acoustic features. Users may view the *.slp file to probe the corpus. The syllable-based prosodic-acoustic features contain syllable logF0 contours by the third-order Legendre polynomial [24], syllable durations, syllable energy levels, inter-syllabic pause durations, and inter-syllabic energy dips.

[24] S. H. Chen and Y. R. Wang, “Vector quantization of pitch information in Mandarin speech,” IEEE Trans. Commun., vol. 38, no. 9, pp. 1317-1320.

4.5. 韻律標記 (Prosody Labeling)

輸入是由第4.4節生成的包含音節基礎的語言和韻律-聲學特徵的TextGrid文件。輸出是包含音節基礎的韻律標籤、語言特徵和韻律-聲學特徵的TextGrid文件。以下步驟執行韻律標註:

  1. 語調識別/標註:構建了一個基於神經網絡的語調識別器,用於對與調變和多音字相關的音節進行重新標註,確保它們具有正確的語調。有關詳細信息,請參閱[25,26]。

  2. 自適應韻律標註和建模(A-PLM)[6-8,12]:它基於自適應的說話速度依賴的分層韻律模型(SR-HPM)工作。自適應的SR-HPM是基於MAP估計制定的,參考SR-HPM作為信息先驗[6,7,8,12]。先驗的SR-HPM是由一位專業的語音者朗讀的大量普通話和混合普通話-英語語料庫訓練得到的。輸入是音節基礎的語言和韻律-聲學特徵。輸出包括說話速度和音節基礎的韻律標籤,包括1)韻律斷點:表示界定發音單位為韻律短語組(PGs)、韻律詞(PWs)、韻律詞組(PPhs)和音節(SYLs)的邊界,以及2)韻律狀態:表示音節的logF0均值、音節持續時間和音節能量水平的模式,這些模式是由PWs、PPhs和PGs引起的。

The inputs are TextGrid files made by Section 4.4 that contain syllable-based linguistic and prosodic-acoustic features. The outputs are TextGrid files that contain syllable-based prosody tags, linguistic features, and prosodic-acoustic features. The following steps execute the prosody labeling:

  1. Tone recognition/labeling: an NN-based tone recognizer is constructed to re-label the syllables related to the tone sandhi and polyphone with correct tones. The detail of the tone recognizer can be found at [25,26].
  2. Adaptive prosody labeling and modeling (A-PLM) [6-8,12]: It works based on the adaptive speaking rate-dependent hierarchical prosodic model (SR-HPM). The adaptive SR-HPM is formulated based on MAP estimation with a reference SR-HPM serving as an informative prior [6,7,8,12]. The prior SR-HPM was trained by the large Mandarin and Mixed Mandarin-English corpora uttered by a professional speaker. The inputs are the syllable-based linguistic and prosodic-acoustic features. The outputs are speaking rate and syllable-based prosody tags, including 1) Prosodic breaks: represent the boundaries to delimit utterances into prosodic units of prosodic phrase groups (PGs), prosodic phrases (PPhs), prosodic words (PWs), and syllables (SYLs), and 2) Prosodic states: represent patterns of syllable’s logF0 mean, syllable duration, and syllable energy level caused by PWs, PPhs, and PGs.

[25] C. Chiang, X. Wang, Y. Liao, Y. Wang, S. Chen and K. Hirose, "Latent Prosody Model of Continuous Mandarin Speech," 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007, pp. IV-625-IV-628, doi: 10.1109/ICASSP.2007.366990.
[26] Chen H.-Y. and Chen S.-H., “Tone Recognition Using MLP and Prosody Model,” National Chiao Tung University, Hsinchu, Taiwan, 2007. [Online]. Available: https://etd.lib.nctu.edu.tw/cgibin/gs32/gsweb.cgi/ccd=IF5emJ/record?r1=2&h1=0

4.6. 建立韻律生成模型 (construction of prosody generation model)

輸入是由第4.5節生成的TextGrid文件,包含音節基礎的韻律標籤、語言特徵和韻律-聲學特徵。輸出是三個韻律生成子模型,可用於知識豐富的TTS框架:

  1. 斷點預測模型:一個基於決策樹的模型,用於從語言特徵生成韻律斷點。
  2. 韻律狀態預測模型:基於決策樹和馬爾可夫模型的機制,用於從預測的斷點和語言特徵預測韻律狀態。
  3. 韻律-聲學特徵預測模型:一個增量模型,根據說話速度生成音節的logF0軌跡、音節持續時間、音節能量水平和音節間的停頓持續時間,並進行高斯歸一化。

The inputs are TextGrid files made in Section 4.5, containing syllable-based prosody tags, linguistic features, and prosodic-acoustic features. The outputs are three prosody generation sub-models that be used in the knowledge-rich TTS framework:

  1. Break prediction model: a decision-tree-based model to produce prosodic breaks from the linguistic features.
  2. Prosodic state prediction model: a decision tree and Markov model-based mechanism to predict prosodic states from the predicted breaks and the linguistic features.
  3. Prosodic-acoustic feature prediction model: an additive model with Gaussian normalization according to speaking rate generates syllable logF0 contours, syllable durations, syllable energy levels, and inter-syllable pause durations.

這三個子模型是通過結構最大後驗語者適應方法[6,12]獲得的,該方法調整了由專業演說者朗讀的大型普通話和混合普通話-英語語料庫訓練的現有模型。

The three sub-models are obtained by the structural maximum a posterior speaker adaptation method [6,12] that adjusts the existing models trained by the large Mandarin and Mixed Mandarin-English corpora uttered by a professional speaker.

4.7. 建立語音合成聲學模型 (construction of acoustic models for speech synthesis)

輸入包括*.wav文件、F0文件(*.f0)和TextGrid文件,其中包含音節基礎的韻律標籤、語言特徵和韻律-聲學特徵。輸出包括:

  1. 語者相依的(SD)基於HMM的TTS模型[27]。
  2. 語者適應的(SA)基於HMM的TTS模型[28]。
  3. HMM狀態對齊(由SD和SA模型生成),這對於開發基於DNN的語音合成模型非常有幫助。

The inputs are the .wav, the F0 files (.f0), and the TextGrid files, which contain syllable-based prosody tags, linguistic features, and prosodic-acoustic features. The outputs are:

  1. Speaker-dependent (SD) HMM-based TTS models [27]
  2. Speaker-adaptive (SA) HMM-based TTS models [28]
  3. HMM state alignments (by the SD and SA models) which help develop DNN-based speech synthesis models.

[27] HMM/DNN-based Speech Synthesis System (HTS). Available: http://hts.sp.nitech.ac.jp/?Download
[28] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai ``Analysis of Speaker Adaptation Algorihms for HMM-based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm,'' IEEE Audio, Speech, & Language Processing vol.17 issue 1, pp.66-83, January 2009


5. 相關研究及討論 (Releated Studies and Discussions)

5.1. Advances Made by the Deep Learning-based End-to-End TTS Framework

深度學習(deep learning)的發展幫助開發人員構建了一個基於端到端深度神經網絡的語音合成系統,而無需了解語言學和信號處理。端到端的深度神經網絡框架忽略了文本分析、語調生成和語音參數生成機制的手工設計。通過大量的訓練語音語料庫,端到端的深度神經網絡可以從觀察到的數據中學到知識,並將所學知識存儲在神經網絡的參數中。知名的 end-to-end TTS 包括Transformer TTS [29]、Tacotron [30, 31]和FastSpeech [32, 33]。

Advances made by deep learning help developers construct an end-to-end deep neural network-based TTS system without knowledge about linguistics and signal processing. The end-to-end DNN frameworks disregard the handcrafted designs of text analysis, prosody generation, and speech parameter generation mechanisms. With a large training speech corpus, the end-toend DNN could learn the knowledge from the observed data and store the learned knowledge in the NN's parameters. Famous end-to-end TTS systems are Transformer TTS [29], Tacotron [30, 31], and FastSpeech [32,33].

[29] Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., & Zhou, M. (2018). Close to human quality TTS with transformer. arXiv preprint arXiv:1809.08895.
[30] Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. ArXiv:1703.10135 [Cs]. http://arxiv.org/abs/1703.10135
[31] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R. J., Saurous, R. A., Agiomyrgiannakis, Y., & Wu, Y. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. ArXiv:1712.05884 [Cs]. http://arxiv.org/abs/1712.05884
[32] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2019). FastSpeech: Fast, Robust and Controllable Text to Speech. ArXiv:1905.09263 [Cs, Eess]. http://arxiv.org/abs/1905.09263
[33] Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2020). Fastspeech 2: Fast and high-quality end-to-end text to speech. ArXiv Preprint ArXiv:2006.04558.








%0



I
文字
(text)



A

文字分析
(text analysis)



I->A





C

語音合成
(speech synthesis)



A->C


語言參數
(linguistic parameters)



D
合成語音
(synthesized speech)



C->D











%0



I
文字
(text)



C

文字轉語音/語音合成
(TTS/speech synthesis)



I->C





D
合成語音
(synthesized speech)



C->D





5.2. Problems about End-to-End Framework

然而,訓練端到端的語音合成系統需要大量的語音數據,以使訓練出的TTS模型具有強韌性。當語音語料庫較小時,文本和語音之間的對齊可能會失敗。因此,Microsoft FastSpeech2 [33] 使用由Montreal Aligner [34]生成的對齊,而不是從Attention map中獲得的對齊,而Montreal Aligner是由Kaldi語音識別工具 [35]所提供。Kaldi的聲學模型生成的對齊可以改善語音持續時間的建模。此外,許多研究報告指出,在端到端TTS訓練的輸入中添加韻律斷點標籤會提高合成語音的自然度 [36-40]。在實踐中,開發人員需要構建一個額外的韻律斷點預測機制,從文本生成韻律斷點,並將其提供給端到端TTS的輸入。韻律斷點信息將有助於訓練,因為韻律斷點表現了語音單位的組織或分組,並且比文本更好地與語音對齊。同一段文本可能以不同的韻律分組策略發音。因此,韻律斷點比文本更直接地代表了韻律分組。

However, training the end-to-end TTS systems requires large amounts of speech data to make the trained TTS model robust. The alignment between text and speech would fail when the speech corpus is small. Therefore, Microsoft FastSpeech2 [33] replaces the alignment derived from the attention map with the one by the Montreal Aligner [34], which is powered by the Kaldi speech recognition tool [35]. The alignment made by Kaldi's acoustic model would improve the duration modeling. Besides, many studies have reported that adding prosodic break tags to the input of the end-to-end TTS training would improve the naturalness of the synthesized speech [36-40]. In practice, developers need to construct an additional prosodic break prediction mechanism to generate prosodic breaks from text, providing prosodic phrasing information to the input of an end-to-end TTS. The prosodic break information would help the training because the prosodic breaks manifest organization or groupings of speech units and better align with speech than text. A speaker may utter sentences with the same text but in different prosodic phrasing strategies. Therefore, the prosodic breaks represent prosodic phrasing more directly than the text.

從上面的討論中,我們可以得出結論:在提升端到端語音合成系統性能方面,對齊和韻律標記發揮著關鍵作用。

From the above discussion, we may conclude that alignment and prosody labeling play essential roles in improving the performances of the end-to-end TTS systems.

[34] McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger (2017). Montreal Forced Aligner: trainable text-speech alignment using Kaldi. In Proceedings of the 18th Conference of the International Speech Communication Association.
[35] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in Proc. 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2011
[36] Liu, J., Xie, Z., Zhang, C. et al. A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2. Int. J. Mach. Learn. & Cyber. 12, 2809–2823 (2021). https://doi.org/10.1007/s13042-021-01365-x
[37] C. Zhang, S. Zhang and H. Zhong, "A Prosodic Mandarin Text-to-Speech System Based on Tacotron," 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 165-169, doi: 10.1109/APSIPAASC47483.2019.9023283.
[38] R. Liu, B. Sisman, F. Bao, G. Gao and H. Li, "Modeling prosodic phrasing with multi-task learning in tacotron-based TTS", IEEE Signal Process. Lett., vol. 27, pp. 1470-1474, 2020.
[39] Jian Zhu, "Probing the phonetic and phonological knowledge of tones in Mandarin TTS models," n Proc. Speech Prosody 2020
[40] Y. Lu, M. Dong and Y. Chen, "Implementing Prosodic Phrasing in Chinese End-to-end Speech Synthesis," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 7050-7054, doi: 10.1109/ICASSP.2019.8682368.


5.3. Potential Advantages of Using the SLMTK

SLMTK的功能已經應用於2020年和2021年為20名ALS患者構建客制化的TTS系統[2-4]。由於ALS患者不是專業的播音員,我們必須在嚴重的發音障礙發生之前儘快錄製患者的語音,因此錄製的語音量是有限的。與專業播音員相比,患者無法如此流利地發音,這可能導致語音語料庫規模較小。每位ALS患者最多錄製30分鐘的語音,使用了特別設計的音韻平衡的文本材料。同時,發現每位患者在相同的文本下會有不同的韻律分組策略。這些不同的韻律分組可能是由於發音障礙的程度或對文本材料理解程度不同引起的。

The functions of SLMTK have been applied to constructing personalized TTS for 20 ALS patients in 2020 and 2021 [2-4]. Since the ALS patients are not professional announcers and we have to record the patients' speech as soon as possible before severe dysarthria occurs, the size of the recorded speech is limited. The patients could not utter as fluently as professional announcers who may easily contribute to large speech corpora. Each ALS patient recorded at most 30 minutes of speech with specially designed phonetically balanced text material. It was also found that each patient would have different prosodic phrasing strategies corresponding to the same text. The discrepant prosodic phrasings may be caused by the degree of dysarthria or comprehension of text material.


上述的限制促使開發人員設計了SLMTK的功能,以精確標記語音語料庫,包括精確的語言-語音對齊和有意義的韻律標籤。這些標記可以確保構建出穩健的TTS模型,因為開發人員可以根據他們對語言學和信號處理的知識,分析具有這些標記的TTS的性能。使用SLMTK標記的基於DNN的個性化TTS系統,經過15位ALS患者和17位患者的照顧者對語音相似度評分的平均意見分數(MOS)進行了評估,得分分別為3.92和3.55 [2-4]。

The limitations mentioned above made the developers design the functions of SLMTK to label the speech corpora with precise linguistic-speech alignments and meaningful prosodic tags. These labelings may ensure constructing robust TTS models because the developers can analyze the performance of a TTS with these labeling according to their knowledge of linguistics and signal processing. The DNN-based personalized TTS systems constructed with the labelings by the SLMTK were assessed by the mean opinion score (MOS) for speaker similarity score. The 15 enrolled ALS patients and 17 patients' caregivers rated the constructed personalized TTSs with MOSs of 3.92 and 3.55, respectively [2-4].


在SLMTK中,韻律標記在識別語音的韻律-聲學特徵和韻律結構之間的對齊中發揮著至關重要的作用。由有適應性的SR-HPM搭配A-PLM方法 [6,12] 驅動的韻律標記克服了傳統手動韻律標記存在的問題,這些問題包括耗時且由於不同的人工標記者主觀判斷不同而導致標記不一致。此外,SLMTK獲得的韻律標籤在語言學和韻律學方面具有解釋性。SR-HPM已經被進一步擴展,可以處理閱讀中文方言 [6]、英語 [7]、中英混合 [8] 和即興普通話 [41],還可以估算當地的說話速率變化 [9-11]。這些SLMTK的特點顯示了它有望持續發展SLMTK的功能。

In the SLMTK, the prosody labeling plays a vital role in identifying the alignment between speech's prosodic-acoustic features and prosodic structure. The prosody labeling powered by the adaptive SR-HPM with the A-PLM method [6,12] overcomes the conventional manual prosody labeling problems that were time-consuming and caused label inconsistency due to different human labelers’ subjective judgments. Moreover, the prosody tags obtained by the SLMTK are linguistically and prosodically interpretable. The SR-HPM has been further extended to process read Chinese dialects [6], English [7], Mandarin-English mixed [8], and spontaneous Mandarin speech [41], and estimation of the local speaking rate variation [9-11]. These features of the SLMTK show that it is promising to keep developing the functions of the SLMTK.

[41] Cheng-Hsien Lin, Chung-Long You, Chen-Yu Chiang, Yih-Ru Wang and SinHorng (2019, Apr). Hierarchical prosody modeling for Mandarin spontaneous speech. The Journal of the Acoustical Society of America, 145, 2576 (2019).


6. 結論及未來展望 (Conclusions and Future Works)

SLMTK 1.0 是一個平台,用於系統性地標記輸入文本和語音,包括語言特徵、聲學特徵、語言-語音對齊、基於音節的語言/韻律-聲學特徵和韻律標籤。這些標記的信息可以用於語音分析以及構建韻律生成和語音合成的模型。此外,SLMTK生成的標籤可以在語言學和信號處理方面的知識指導下進行探索和修改,以增強訓練過的TTS模型的穩健性,特別是在語料庫規模較小的情況下。未來,SLMTK工作小組將擴展功能,以支援其他資源匱乏的中文方言。此外,SLMTK的部分功能將成為開源項目,歡迎開發者參與,豐富SLMTK的功能。

The SLMTK 1.0 is a platform to systematically label input text and speech with linguistic features, acoustic features, linguistic-speech alignment, syllable-based linguistic/prosodic-acoustic features, and prosody tags. The labeled information can be used in speech analysis and building models for prosody generation and speech synthesis. Furthermore, the labels produced by the SLMTK can be probed and modified with knowledge in linguistics and signal processing to enhance the robustness of the trained TTS models, especially when the corpus size is small. In the future, the SLMTK working group will extend functions to support other under-resourced Chinese dialects. Besides, part of the SLMTK will be open-source projects to welcome developers enriching the functions of the SLMTK.


7. 致謝 (Acknowledgments)

本研究主要由台灣科技部資助,合同號碼分別為MOST-109-2221-E-305-010-MY3和MOST-109-3011-F-011-001。部分研究工作也得到了台灣科技部的支持,合同號碼為MOST-110-2627-M-006-007。作者們要感謝Y.-S. Gong先生和Z.-Y. Tseng女士的悉心指導。同時,也感謝台灣運動神經元疾病協會協調錄製了受試ALS患者的語音數據。

This work was mainly supported by the MOST of Taiwan under Contract Nos. MOST-109-2221-E-305-010-MY3 and MOST-109-3011-F-011-001. Part of this work is also supported by the MOST of Taiwan under Contract No. MOST-110-2627-M-006-007. The authors would like to thank Mr. Y.-S. Gong and Ms. Z.-Y. Tseng for their helpful guidance. The authors also thank the Taiwan Motor Neuron Disease Association for coordinating speech recordings for the enrolled ALS patients.


參考文獻 (Reference)

[1] TextGrid file formats. [Online]. Available: https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html
[2] Y. H. Liu, C. Y. Chiang, C. K. Peng, F. C. Yang, and Y. K. Lin, “Development of a Smart Sclerosis Communication System for Patients with ALS: Outcome Add-On and Application,” Ministry of Science and Technology, Taiwan, Oct. 2021. [Online]. Available: https://www.grb.gov.tw/search/planDetail?id=13443860 (in Chinese)
[3] C. Y. Chiang, “Revoice Project Taiwan.” Jan. 25, 2022. [Online]. Available: https://hackmd.io/@cychiang-ntpu/BJ5Hfv2s_ (in Chinese)
[4] C. Y. Chiang et al., “Project Save and Sound: Constructing Personalized Mandarin Text-to-Speech Systems for ALS Patients,” prepared to submit to Augmentative and Alternative Communication, [Online]. Available: https://hackmd.io/@cychiang-ntpu/Bk7_iARM9
[5] B. X. Huang, C. H. Lai, and T. H. Liu, “On the Interaction between L1 Transfer and Universal Constraints–Evidence from the Acquisition of Mandarin Tones by French Speakers,” submitted at the North American Conference on Chinese Linguistics, USA, Sep. 2022.
[6] C. Y. Chiang, “Cross-Dialect Adaptation Framework for Constructing Prosodic Models for Chinese Dialect Text-to-Speech Systems,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 26, no. 1, pp. 108–121, Jan. 2018.
[7] C.-Y. Tsai, C.-K. Kuo, Y.-R. Wang, S.-H. Chen, I.-B. Liao, and C.-Y. Chiang, “Hierarchical prosody modeling of English speech and its application to TTS,” in 2014 17th O-COCOSDA, Phuket, Thailand, Sep. 2014, pp. 1–6.
[8] Tsai C.-Y. and Chen S.-H., “Prosody Hierarchy Construction for Mixed Chinese-English Spelling Speech and its Application to TTS,” National Chiao Tung University, Hsinchu, Taiwan, 2010. [Online]. Available: https://ir.nctu.edu.tw/handle/11536/44584
[9] G.-T. Liou, C.-Y. Chiang, Y.-R. Wang, and S.-H. Chen, “Estimation of Hidden Speaking Rate,” in Speech Prosody 2018, Jun. 2018, pp. 592–596.
[10] G.-T. Liou, C.-Y. Chiang, Y.-R. Wang, and S.-H. Chen, “An Exploration of Local Speaking Rate Variations in Mandarin Read Speech,” in Interspeech 2018, Sep. 2018, pp. 42–46.
[11] C. Y. Chiang, G. T. Liou, Y. R. Wang, and S. H. Chen, “Method of generating estimated value of local inverse speaking rate (ISR) and device and method of generating predicted value of local ISR accordingly,” US20210035598A1, Feb. 04, 2021
[12] I.-B. Liao, C.-Y. Chiang, Y.-R. Wang, and S.-H. Chen, “Speaker Adaptation of SR-HPM for Speaking Rate-Controlled Mandarin TTS,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 11, pp. 2046–2058, Nov. 2016.
[13] P. Boersma and D. Weenink, Praat: doing phonetics by computer. 2022. [Online]. Available: https://www.fon.hum.uva.nl/praat/
[14] WaveSurfer. 2022. [Online]. Available: https://sourceforge.net/projects/wavesurfer/
[15] H. Fujisaki, “Information, prosody, and modeling - with emphasis on tonal features of speech,” in Speech Prosody 2004, Nara, Japan, Mar. 2004, pp. 1–10.
[16] W. J. M. Levelt, A. Roelofs, and A. S. Meyer, “A theory of lexical access in speech production,” Behav. Brain Sci., vol. 22, no. 01, Feb. 1999.
[17] T. Kudo, CRF++: Yet another CRF toolkit. 2013. [Online]. Available: https://taku910.github.io/crfpp/
[18] A.-H. Lin, Y.-R. Wang, and S.-H. Chen, “Traditional Chinese parser and language modeling for Mandarin ASR,” in 2013 O-COCOSDA/CASLRE, Gurgaon, India, Nov. 2013, pp. 1–5.
[19] L. L. Chang, “Part of speech (POS) analysis on Chinese language,” Technical Report, the Institute of Information Science, Academia Sinica, ROC, 1989.
[20] W.-Y. Chang et al., “A Mandarin-English Mixed TTS Constructed by Independent Mandarin and English Corpora,” in 20th O-COCOSDA 2017, Seoul, South Korea, Nov. 2017, pp. 115–120.
[21] G.-T. Liou et al., “A Study on Polyphone Disambiguation and Tone 3 Sandhi Labeling for Traditional Chinese,” presented at the 17th International Conference Oriental COCOSDA, Phuket, Thailand, Sep. 2014.
[22] D. Talkin and W. B. Kleijn, “A robust algorithm for pitch tracking (RAPT),” Speech coding and synthesis, vol. 495, p. 518, 1995.
[23] HTK speech recognition toolkit. [Online]. Available: https://htk.eng.cam.ac.uk/
[24] S.-H. Chen and Y.-R. Wang, “Vector quantization of pitch information in Mandarin speech,” IEEE Trans. Commun., vol. 38, no. 9, pp. 1317–1320, Sep. 1990.
[25] C.-Y. Chiang, X.-D. Wang, Y.-F. Liao, Y.-R. Wang, S.-H. Chen, and K. Hirose, “Latent Prosody Model of Continuous Mandarin Speech,” in ICASSP ’07, Honolulu, HI, Apr. 2007, p. IV-625-IV–628.
[26] Chen H.-Y. and Chen S.-H., “Tone Recognition Using MLP and Prosody Model,” National Chiao Tung University, Hsinchu, Taiwan, 2007. [Online]. Available: https://etd.lib.nctu.edu.tw/cgibin/gs32/gsweb.cgi/ccd=IF5emJ/record?r1=2&h1=0
[27] K. Oura and K. Tokuda, HMM/DNN-based Speech Synthesis System (HTS). 2012. [Online]. Available: http://hts.sp.nitech.ac.jp/?Download
[28] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai, “Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm,” IEEE Trans. Audio Speech Lang. Process., vol. 17, no. 1, pp. 66–83, Jan. 2009.
[29] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Neural Speech Synthesis with Transformer Network,” arXiv:1809.08895 [cs], Jan. 2019, Accessed: Mar. 28, 2022. [Online]. Available: http://arxiv.org/abs/1809.08895
[30] Y. Wang et al., “Tacotron: Towards End-to-End Speech Synthesis,” arXiv:1703.10135 [cs], Apr. 2017, Accessed: Jan. 21, 2022. [Online]. Available: http://arxiv.org/abs/1703.10135
[31] J. Shen et al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” in ICASSP’18, Calgary, AB, Apr. 2018, pp. 4779–4783.
[32] Y. Ren et al., “FastSpeech: Fast, Robust and Controllable Text to Speech,” arXiv:1905.09263 [cs, eess], Nov. 2019, Accessed: Jan. 21, 2022. [Online]. Available: http://arxiv.org/abs/1905.09263
[33] Y. Ren et al., “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
[34] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” in Interspeech 2017, Aug. 2017, pp. 498–502.
[35] D. Povey et al., The Kaldi Speech Recognition Toolkit. 2011. [Online]. Available: https://kaldi-asr.org/
[36] J. Liu, Z. Xie, C. Zhang, and G. Shi, “A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2,” Int. J. Mach. Learn. & Cyber., vol. 12, no. 10, pp. 2809–2823, Oct. 2021.
[37] C. Zhang, S. Zhang, and H. Zhong, “A Prosodic Mandarin Text-to-Speech System Based on Tacotron,” in 2019 APSIPA ASC, Lanzhou, China, Nov. 2019, pp. 165–169.
[38] R. Liu, B. Sisman, F. Bao, G. Gao, and H. Li, “Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS,” IEEE Signal Process. Lett., vol. 27, pp. 1470–1474, 2020.
[39] J. Zhu, “Probing the phonetic and phonological knowledge of tones in Mandarin TTS models,” arXiv:1912.10915 [cs, eess], Dec. 2019, Accessed: Mar. 28, 2022. [Online]. Available: http://arxiv.org/abs/1912.10915
[40] Y. Lu, M. Dong, and Y. Chen, “Implementing Prosodic Phrasing in Chinese End-to-end Speech Synthesis,” in ICASSP 2019, Brighton, United Kingdom, May 2019, pp. 7050–7054.
[41] C.-H. Lin, C.-L. You, C.-Y. Chiang, Y.-R. Wang, and S.-H. Chen, “Hierarchical prosody modeling for Mandarin spontaneous speech,” The Journal of the Acoustical Society of America, vol. 145, no. 4, pp. 2576–2596, Apr. 2019.