--- title: SLMTK-Home tags: SLMTK-WebSite --- # Speech Labeling and Modeling Toolkit (SLMTK) - Home # 0. Table of Contents [TOC] # 1. Introduction The Speech Labeling and Modeling Toolkit version 1.0 (SLMTK 1.0) is built to facilitate constructing text-to-speech (TTS) systems. The SLMTK labels speech corpora with linguistic and prosodic-acoustic information. The labeled information can construct TTS models and be used for speech analysis. For users’ convenience, the labeled information is mainly saved in Praat’s [TextGrid format](https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html). For more details, please refer to [the paper submitted to the Interspeech 2022 conference](https://drive.google.com/file/d/1UwwrL3hSMAJbrd54rt_qdFM95vMIhSjg/view?usp=sharing). The core SLMTK functions have already supported constructing personalized TTS systems used in augmentative and alternative communication (AAC) for 20 amyotrophic lateral sclerosis (ALS) patients. Since August 2021, the personalized TTS systems for the 20 ALS patients have been available online. The patients can log in to the web-based TTS service as they would like to use their synthesized speech to communicate. For more details, please refer to [Revoice/Save & Sound Project Webpage](https://hackmd.io/@cychiang-ntpu/回聲計畫). ## 1.1. Overview of SLMTK 1.0 The usage of the [SLMTK 1.0](https://slmtk.ce.ntpu.edu.tw/slmtk.php) is shown in Fig. 1. Users may upload the speech corpus recorded by a speaker to the SLMTK service. The corpus contains speech saved in \*.wav and \*.txt encoded in UTF-8. The [SLMTK 1.0](https://slmtk.ce.ntpu.edu.tw/slmtk.php) then conducts speech labeling and modeling with the pre-trained prior models. ```graphviz digraph G { compound=true rankdir=LR graph [ fontname="Source Sans Pro", fontsize=16 ]; node [ fontname="Source Sans Pro", fontsize=16]; edge [ fontname="Source Sans Pro", fontsize=16 ]; subgraph cluster_Input { label = "Input" fontsize = 12 style = dotted subgraph cluster_I { label = "Input Corpus" fontsize = 12 style = dotted {rank=same I2 I1} } O3 } subgraph cluster_T { label = "Speech Labeling and Modeling" fontsize = 12 style = dotted subgraph cluster_P { label = "SLMTK version 1.0" fontsize = 13 style = dotted {rank=same M0 M2} } M3 } subgraph cluster_Output { label = "Output" fontsize = 12 style = dotted {rank=same O1 O4 O2} } I1 [fixedsize=false label="*.txt" shape=folder fontsize="" ] I2 [fixedsize=false label="*.wav" shape=folder fontsize=""] M0 [fillcolor=yellow fixedsize=false label="SLMTK\nPrior Models &\nCodebase" shape=cylinder fontsize=""] #M1 [fillcolor=yellow fixedsize=false label="SLMTK\nCodebase" shape=cylinder fontsize=""] M2 [fixedsize=false label="SLMTK Tools" shape=box fontsize="18"] O1 [fillcolor=yellow fixedsize=false label="Corpus Label Files \n(\*.TextGrid)" shape=folder fontsize=""] O2 [fillcolor=yellow fixedsize=false label="TTS Models" shape=folder fontsize=""] O4 [fillcolor=yellow fixedsize=false label="TTS Demos" shape=folder fontsize=""] M3 [fixedsize=false label="Manual Correction\n(with Praat or other tools)" shape=box fontsize=""] O3 [fillcolor=yellow fixedsize=false label="Modified \nCorpus Label Files (\n*.TextGrid)" shape=folder fontsize=""] I1->M2 I2->M2 M0->M2 M2->O1 M2->O2 O1->M3 M3->O3 M2->O4 O3->M2 } ``` <center> Fig. 1: The Usage of the SLMTK 1.0 </center> <br> The SLMTK 1.0 processes input corpus with the following seven steps: 1) **t**ext **a**nalysis (ta), 2) **a**coustic **f**eature **e**xtraction (afe), 3) **l**inguistic-**s**peech **a**lignment (lsa), 4) **i**ntegration of syllable-based **l**inguistic and **p**rosodic-acoustic features (ilp), 5) **p**rosody **l**abeling (and **m**odeling) (plm), 6) **c**onstruction of **p**rosody **g**eneration model (cpg), and 7) construction of acoustic models for speech synthesis with HMM-based synthesis (HTS) (hts). ## 1.2. Speech Labeling & Modeling Framework These seven steps form the framework of the SLMTK1.0 as shown in Fig. 2. Users may trace the seven steps with Fig. 2. ```graphviz digraph { compound=true rankdir=TD graph [ fontname="Source Sans Pro", fontsize=36 ]; node [ fontname="Source Sans Pro", fontsize=36]; edge [ fontname="Source Sans Pro", fontsize=36 ]; subgraph core { I1 [label="text (\*.txt)" fixedsize=false] [shape=plaintext] } subgraph core { I2 [label="speech (\*.wav)" fixedsize=false] [shape=plaintext] } subgraph core { A [label="1) text analysis" fixedsize=false] [shape=box margin=0.2 style=bold] } subgraph core { Q [label="2) acoustic feature extraction" fixedsize=false] [shape=box margin=0.2 style=bold] } subgraph core { R [label="acoustic feature files (*.f0)" fixedsize=false] [shape=plaintext] } subgraph core { L [label="linguistic label files (\*.llf)" fixedsize=false] [shape=plaintext] } subgraph core { B [label="3) linguistic-speech alignment" fixedsize=false] [shape=box margin=0.2 style=bold] } subgraph core { Z [label="linguistic-speech alignment files (\*.TextGrid)" fixedsize=false] [shape=plaintext] } subgraph core { X [label="4) integration of syllable-based linguistic and\nprosodic-acoustic features" fixedsize=false][shape=box margin=0.25 style=bold] } subgraph core { U [label="syllable-based linguistic and prosodic-acoustic\nfeature files (\*.slp, \*.TextGrid)"][shape=plaintext margin=0.2] } subgraph core { C [label="5) prosody labeling"] [shape=box margin=0.2 style=bold] } subgraph core { T [label="prosody tag files (\*.TextGrid)"] [shape=plaintext margin=0.2] } subgraph core { D [label="6) construction of prosody\ngeneration model"] [shape=box margin=0.2 style=bold] } subgraph core { E [label="7) construction of acoustic models\nfor speech synthesis"] [shape=box margin=0.25 style=bold] } subgraph core { F [label="AM: acousitc model for speech synthesis"] [shape=plaintext margin=0.2] } subgraph core { G [label="PGM: prosody generation model"] [shape=plaintext margin=0.2] } subgraph core { H [label="Knowledge-Rich TTS" style=bold] [shape=box margin=0.2] } subgraph core { HI [label="text"] [shape=plaintext margin=0.1] } subgraph core { HO [label="synthesized speech"] [shape=plaintext margin=0.2] } I1->A [arrowsize=2.0] R->B [arrowsize=2.0] I2->Q [arrowsize=2.0] Q->R [arrowsize=2.0] A->L [arrowsize=2.0] L->B [arrowsize=2.0] B->Z [arrowsize=2.0] C->T [arrowsize=2.0] T->D [arrowsize=2.0] D->G [arrowsize=2.0] T->E [arrowsize=2.0] E->F [arrowsize=2.0] #L->C [arrowsize=2.0] #L->D [arrowsize=2.0] R->E [arrowsize=2.0] #Z->E [arrowsize=2.0] #L->E [arrowsize=2.0] #Z->D [arrowsize=2.0] #R->D [arrowsize=2.0] X->U [arrowsize=2.0] R->X [arrowsize=2.0] Z->X [arrowsize=2.0] #L->X [arrowsize=2.0] U->C [arrowsize=2.0] F->H [arrowsize=2.0] G->H [arrowsize=2.0] HI->H [arrowsize=1.0] H->HO [arrowsize=1.0] #Z->C [arrowsize=2.0] I2->E [arrowsize=2.0] subgraph cluster_TTS { style=dotted {rank=same HI H HO} } } ``` <center> Fig. 2: The Framework of the SLMTK 1.0 </center> <br> After finishing the seven steps, the SLMTK generates parameters of the TTS models and creates the TTS demo system (in the ***[Knowledge-Rich Text-to-Speech Framework](#13-Knowledge-Rich-Text-to-Speech-Framework)***) on the SLMTK service. Besides, the SLMTK outputs the corpus label files that contain linguistic, acoustic, and prosodic-acoustic features. The corpus label files are mostly saved in Praat’s TextGrid format, making users easy to probe. ## 1.3. Knowledge-Rich Text-to-Speech Framework The TTS framework (shown in Fig. 3) is inspired by the [speech production process](https://www.isca-speech.org/archive/speechprosody_2004/fujisaki04_speechprosody.html). The process can be interpretable by the knowledge gained in linguistics and signal processing. We, therefore, call this framework "knowledge-rich TTS framework." ```graphviz digraph { compound=true rankdir=TD graph [ fontname="Source Sans Pro", fontsize=16 ]; node [ fontname="Source Sans Pro", fontsize=36]; edge [ fontname="Source Sans Pro", fontsize=16 ]; subgraph core { T [label="text"] [shape=plaintext] } subgraph core { TA [label="TA: text analysis"] [shape=box style=bold margin=0.3] } subgraph core { L [label=" linguistic features"] [shape=plaintext margin=0.0] } subgraph core { PG [label="PG: prosody generation"] [shape=box style=bold margin=0.3] } subgraph core { P [label="prosody parameters"] [shape=plaintext margin=0.01] } subgraph core { SG [label="SG: speech parameter generation"] [shape=box style=bold margin=0.3] } subgraph core { A [label=" acoustic parameter"] [shape=plaintext margin=0.0] } subgraph core { WG [label="WG: vocoder/waveform generation"] [shape=box style=bold margin=0.3] } subgraph core { VM [label=" VM: vocoding model"] [shape=plaintext margin=0.0] } subgraph core { S [label="synthesized speech"] [shape=plaintext margin=0.0] } subgraph core { PGM [label="PGM: prosody generation model"] [shape=plaintext margin=0.2] } subgraph core { AM [label=" AM: acoustic model for speech synthesis "] [shape=plaintext margin=0.0] } subgraph cluster_MP { label = "message planning" fontsize = 30 style="dashed" T TA L #{rank=same T TA L} } subgraph cluster_UP { label = "utterance planning" fontsize = 30 style="dashed" P PG PGM #{rank=same T TA L} } subgraph cluster_MS { label = "motor command generation and\nspeech sound production" fontsize = 30 style="dashed" AM SG A WG VM S #{rank=same T TA L} } T->TA [arrowsize=2.0] TA->L [arrowsize=2.0] L->PG [arrowsize=2.0] L->SG [arrowsize=2.0] PG->P [arrowsize=2.0] P->SG [arrowsize=2.0] SG->A [arrowsize=2.0] A->WG [arrowsize=2.0] WG->S [arrowsize=2.0] PGM->PG [arrowsize=2.0] AM->SG [arrowsize=2.0] VM->WG [arrowsize=2.0] } ``` The TA extracts linguistic information from the input text. The linguistic information contains lexical, syntactic, and partly semantic features. Text is a representation organized by the symbols defined by humans to record the information that a speaker wants to convey. The TA can be regarded as an inverse of the message planning process that converts from a message represented in a text to linguistic information that a speaker wants to convey. The PG produces prosodic information by the prosody generation model (PGM) given with the linguistic features extracted by the TA. The prosodic information considered in the SLMTK contains prosodic breaks, prosodic states, and syllable-based prosodic-acoustic features. Note that prosodic breaks and prosodic states can represent a prosodic structure. The syllable-based prosodic-acoustic features are prosodic realizations of each prosodic unit (i.e., syllable in the SLMTK). Therefore, the PG is analogous to the utterance planning process that converts planned messages to prosodic realizations. The SG generates speech parameters that control the fundamental frequency and spectral envelope by the acoustic model for speech synthesis. Last, the WG produces speech signals by some vocoding models given the speech parameters. The SG and WG processes simulate humans’ motor command generation and speech sound production. # 2. Tutorial Part-1 ## 2.1. Registeration & Login * Please register at https://slmtk.ce.ntpu.edu.tw/login/signup.php. * After registration, you may log in to https://slmtk.ce.ntpu.edu.tw/login/login.php with your username and password. If you forget your password, please [contact us](https://hackmd.io/@cychiang-ntpu/HkL_WdHf9). The SLMTK working group will help you. * You can watch the following video for your reference: {%youtube sLWzhzaaTNw %} ## 2.2. Download the example wave and text files * We provide example zipped text and wav files for your reference: * zipped \*.txt file `txt.zip` at: https://drive.google.com/file/d/1KnU2yxbHQm7_yQzR7qBvJHw0bm1RezVJ/view?usp=sharing * zipped \*.wav file `wav.zip` at: https://drive.google.com/file/d/1Z66_dJnlUG2GewSf66MJrWWqCpC0NMTF/view?usp=sharing * You may unzip the wav.zip and text.zip to observe the 13 \*.wav and 13 \*.txt files. The \*.wav files are recorded by Chen-Yu Chiang (江振宇), the PI of the SLMTK. The corresponding material is the mixed Chinese-English text. * You can watch the following video for your reference: {%youtube TN4oRHJ7K0Y %} ## 2.3. Upload zipped files * Go to the SLMTK 1.0 page: https://slmtk.ce.ntpu.edu.tw/slmtk.php and upload `text.zip` and `wav.zip` the website. ***Note that 1) the total size of wav.zip+text.zip cannot exceed 50MB, and 2) the total length of \*.wav files cannot exceed 10 minutes (600 seconds).*** See the snapshot below for your reference. ![](https://hackmd.io/_uploads/Bkv87iLXc.png) * If upload is done, you will see the webpage like the snapshot below: ![](https://hackmd.io/_uploads/SyPGs-zm5.png) * You can watch the following video for your reference: {%youtube be2KAwENvAo %} ## 2.4. Set F0 range and run 0. initializing worksite * The F0 upper and lower bounds of the example speech are 350Hz and 60Hz, respectively. You may set these value as the following snapshot. ![](https://hackmd.io/_uploads/rklVaWMXq.png) * You may download the uploaded files for checking as shown in the following snapshot. ![](https://hackmd.io/_uploads/BytiRZGXq.png) * You can watch the following video for your reference: {%youtube AQMNDwve7ts %} # 3. Tutorial Part-2 ## Step 1. Text Analysis (ta) * This step takes within 1 minute. * When 0. initializing worksite is done and checked, you may press the Next button to execute 1. Text Analysis (ta) as shown in the following snapshot. * When 1. Text Analysis (ta), you may press the Download button to download the results of 1. Text Analysis. * You can watch the following video for your reference: {%youtube vTGPrZFBFtc %} ## Step 2. Acoustic Feature Extraction (afe) * This step takes within 1 minute. * When 1. Text Analysis (ta) is done and checked, you may press the Next button to execute 2. Acoustic Feature Extraction (afe) as shown in the following video. {%youtube 2WAQp7xg_yg %} ## Step 3. Linguistic-Speech Alignment (lsa) * This step takes within 1 minute. * When 2. Acoustic Feature Extraction is done and checked, you may press the Next button to execute 3. Linguistic-Speech Alignment (lsa) as shown in the following video. {%youtube E0EqkU4op-0 %} * After 3 Linguistic-Speech Alignment (lsa) is done and checked, you may download the TextGrid files and view them by Praat as shown in the following video. {%youtube ZEZcgAy7k9Y %} ## Step 4. Integration of Linguistic Features and Prosodic Features (ilp) * This step takes around 2 minutes. * When 3. Linguistic-Speech Alignment (lsa) is done and checked, you may press the Next button to execute 4. Integration of Linguistic Features and Prosodic Features (ilp) as shown in the following video. {%youtube bJOdZTRZx4Y %} ## Step 5. Prosody Labeling and Modeling (plm) * This step takes around 10 minutes. * When 4. Integration of Linguistic Features and Prosodic Features (ilp) is done and checked, you may press the Next button to execute 5. Prosody Labeling and Modeling (plm) as shown in the following video. {%youtube 5bemg7mqnkg %} * You may download the prosody labeling result as shown in the following video. {%youtube FMZUcMytjmY %} ## Step 6. Construction of Prosody Generation Model (cpg) * This step takes around 2 minutes. * When 5. Prosody Labeling and Modeling (plm) is done and checked, you may press the Next button to execute 6. Construction of Prosody Generation Model (cpg) as shown in the following video. {%youtube Ii_VQ5J7Qk8 %} ## Step 7. HTS Training (hts) * This step takes around 26 minutes. * When 6. Construction of Prosody Generation Model (cpg) is done and checked, you may press the Next button to execute 7. HTS Training (hts) as shown in the following video. {%youtube guwHxOS5fGE %} ## Step 8. Deploy TTS * After 7 HTS Training (hts) is done and checked, you may deploy the TTS system with the adapted prosody generation and speech synthesis models on the SLMTK 1.0 and play TTS on the SLMTK 1.0. Please watch the following video to find how to deploy the TTS and play the TTS system. You may put mixed Chinese-English texts on the TTS GUI to test the TTS system. Here is the text example for your reference: `您好,我是江振宇。Hello! This is Chen-Yu Chiang. Nice to meet you.` {%youtube jsRAyjkbx5w %} --- # 4. SLMTK Working Group ## SMSPLab, Department of Communication Engineering, National Taipei University, Taiwan * Chen-Yu CHIANG (江振宇) * Primary investigator/Associate Prof./Dept. of Communication ENG * System design, prosody generation * Wu-Hao Lee (李武豪) * PhD student * Text analysis, acoustic modeling, text-speech alignment, tone recognition * Yan-Ting Lee (林彥廷) * PhD student * acoustic modeling, vocoding modeling * Shu-Lei Lin (林書磊) * Master student * cross language attribute detection, voice conversion * Jia-Jyu Su (蘇家駒) * Undergraduate student * Frontend and backend development for the SLMTK 1.0 and VoiceBank websites * Pin-Han Lin (林品翰) * Master student * Frontend and backend development for the SLMTK 1.0 and AAC websites * Shao-Wei Hong (洪紹瑋) * Master student (Alumni, CE'21) * Speech recognition, text normalization ## AcoustInTek Co., Ltd. * Wen-Yang Chang (張文陽) * Founder * Corpus design, recording engineering * Cheng-Che Kao (高晟哲) * Co-founder * System integration, corpus processing * Wei-Cheng Chen (陳韋成) * Engineer * TTS system design * Jen-Chieh Chiang (江仁杰) * Engineer * TTS system design ## Other parties ### National Yang Ming Chiao Tung University * Guan-Ting Liou (劉冠廷) * PhD student * Corpus Design * Yih-Ru Wang (王逸如) * Associate Prof./EECS * Consultant * Sin-Horng Chen (陳信宏) * Prof./EECS * Consultant ### udnDigital Co., Ltd., Taiwan * Yen-Ting Lin (林衍廷) * Engineer * Corpus Processing ### Individuals * Kenny Peng (彭康硯) * Scientiest, Yahoo * Consultant * Jefferey Sac Chang (張軍毅) * THT (taiwan hacker tech) * Consultant # 5. Learning Materials ## [A Short Introduction to Spoken Language Processing](https://hackmd.io/@cychiang-ntpu/An-Introduction-to-Spoken-Language-Processing) # Contact {%hackmd @cychiang-ntpu/SLMTK-Contact %} # Acknowledgements This work was mainly supported by the MOST of Taiwan under Contract Nos. MOST-109-2221-E-305-010-MY3 and MOST-109-3011-F-011-001. Part of this work is also supported by the MOST of Taiwan under Contract No. MOST-110-2627-M-006-007. The authors would like to thank Mr. Y.-S. Gong and Ms. Z.-Y. Tseng for their helpful guidance.