# Project Save and Sound (Project Revoice): Constructing Personalized Mandarin Text-to-Speech Systems for ALS Patients (Draft)
> C. Y. Chiang et al., “Project Save and Sound: Constructing Personalized Mandarin Text-to-Speech Systems for ALS Patients,” prepared to submit to *Augmentative and Alternative Communication*
Chen-Yu Chiang$^1$, Yen-Ting Lin$^1$, Wu-Hao Lee$^1$, Cheng-Che Kao$^2$, Wei-Cheng Chen$^2$, Jen-Chieh Chiang$^2$, Guan-Ting Liou$^3$, Pin-Han Lin$^1$, Shu-Lei Lin$^1$, Shao-Wei Hong$^1$, Jia-Jyu Su$^1$
$^1$Department of Communication Engineering, National Taipei University, Taiwan
$^2$AcoustInTek Co., Ltd., Taiwan
$^3$National Yang Ming Chiao Tung University, Taiwan
Table of Contents
------
[TOC]
## Author Note
This work was mainly supported by the MOST of Taiwan under Contract No. MOST-109-3011-F-011-001. Part of this work is also supported by the MOST of Taiwan under Contract No. MOST-109-2221-E-305-010-MY3. We have no conflicts of interest to disclose. Correspondence concerning this article should be addressed to Chen-Yu Chiang, Department of Communication Engineering, National Taipei University, No. 151, University Rd., Sanxia Dist., New Taipei City 237303, Taiwan (R.O.C.) Email: cychiang@mail.ntpu.edu.tw
## Abstract
Patients diagnosed with Amyotrophic lateral sclerosis (ALS) will gradually lose the ability to control their muscles. Lacking the control of vibration of vocal cord and shape of vocal tract make it difficult to pronounce and to communicate smoothly. The Project Save and Sound provides the opportunities for ALS patients to record and rebuild their own unique voices, i.e., customized text-to-speech (TTS) systems. After inputting texts to the constructed customized TTS systems, the systems may generates the voices with patients' speaker identities. During the running of the project from April 2020 to August 2021, we have built customized TTS systems for the 20 patients enrolled. From the experience of this project, we found that the speech disorders at different stages of ALS will affect the similarities of the rebuilt voices. To make the synthesized voice more similar to their own unique voices, we encourage ALS patients to preserve their voices as soon as possible before speech disorders become severe.
> Keywords: text-to-speech, personalized, Mandarin, amyotrophic lateral sclerosis (ALS), motor neuron disease
# Introduction
Everyone has his/her unique voice. The features of speech, which contain the information to others, can be categorized into linguistic, para-linguistic, and non-linguistic features. Linguistic features contain lexicon, syntactic, semantic, and pragmatic information. Para-linguistic features contain intentional, attitudinal, and stylistic information. Non-linguistic features contain physical and emotional information. Those features represent the physical size, age, gender, race, intellectual ability, and personality of the speaker. The ALS patients will gradually lose the ability to control their muscles, which affects the control of the glotal fold and shape of vocal tract, and become difficult to pronounce and communicate smoothly.
In English-speaking countries, there are many research institutes and commercial companies providing the services to make personalized TTS systems for ALS patients to use. Though the AI boom in recent years makes performances of Mandarin TTS systems much better than before, services and techniques to build personalized TTS systems for ALS patients are rarely mentioned. The main reasons are that the speech data which ALS patients could provide is extremely scarce and that the personalized TTS techniques are hard to build acceptable and usable services for patients with such little data. Thus, the Project Save and Sound has built a complete set of services to make the personalized TTS, hoping to overcome the problem of lacking data from patients. With professional recording tools, specially designed texts, and dedicated staff to assist with voice recording, we collect and preserve the precious speech data, combined with modern speech synthesis technologies, and provide a service for ALS patients and their families to speak with their unique voice after entering the text on the auxiliary device (AAC).
---
## Timeline
This subsession roughly describes the previous events related to this project.
* 2002: ModelTalker started to provide voice banking service from Nemours biomedical research and university of Delaware.
* 2014: Ice Bucket Challenge has drawn public attention to the issues of ALS patients.
* 2018: Project Revoice initiated a platform to introduce several voice cloning services for English-speaking ALS patients.
* 2018: Ministry of Science and Technology (MOST), Tawian, started the Technology breakthrough project “Life Fix experimental project series — case 1: Developing and integrating the smart communication systems for ALS patients”. The voice conversion technology was applied to convert voices from some commercial TTS systems to some specific ALS patients.
* 2020-2021: the MOST, Taiwan, founded the Save and Sound (Revoice Taiwan) project (this study) and create 20 personalized TTS systems for the enrolled ALS patients.
---
## Related Projects, Services, and Technologies
### *Projects*
* Project Revoice: A platform built by the ALS association in US intends to let patients know that there are technologies to build a customized TTS for themselves, encourages the patients to contact the institutes or vendors as soon as possible, and provides the recommendation information.
* Voice Banking: A project proposed by motor neurone disease (MND) association in the UK. The MND association recommend the patients to *deposit* their voices and messages.
### *Services*
* Model Talker (www.modeltalker.org):
The largest research platform in US, established by the Nemours Speech Research Laboratory located at the Alfred I. duPont Hospital for Children in Delaware, US.
* Cereproc Cerevoice ME (www.cereproc.com):
A private company established from 2005 in the UK, that provides voice cloning technology in English, Spanish, French, Swedish, Italian, and Romanian, also famous for their customized TTS service for cerebral palsy children.
* VocalID (https://vocalid.ai/):
A company located at Massachusetts, US, founded in 2014, who is well known for its approach to voice banking.
* Acapela my-own-voice DNN (https://www.acapela-group.com/):
A customized TTS platform based on deep neural network technology launched by Acapela Group in 2020. Acapela Group was officially founded in 2004 as a merger of three European companies: Babel Technologies (BE), Elan speech (FR) and Infovox (SWE).
* The Voice Keeper (thevoicekeeper.com):
A company found in 2011 in Israel. Starting from 2015, it provides customized TTS services for ALS patients. They could make a TTS system with a voice like the patient’s voice by using only 3 minutes of the speech from the patient.
* SpeakUnique (https://www.speakunique.co.uk/about-us/)
Note that SpeakUnique and VocaliD provide speech repairment for mildly to moderately impaired speech.
### *State of the Art TTS Technologies*
Digital text to speech technologies have been developed for over 40 years.
In virtue of technologies in big data and deep learning, the paradigm of a TTS system has been shifted from a TTS pipeline made of cascading several machine learning models to a pipeline constructed by cascading deep learning models, or even only one end-to-end deep learning structure.
Fig. x depicts a general framework for an end-to-end TTS system. The system can be divided into
The end-to-end DNN frameworks disregard the handcrafted designs of text analysis, prosody generation, and speech parameter generation mechanism. With a large training speech corpus, the end-to-end DNN could learn the mapping from raw text to speech parameter. The popular speech parameters knowledge from the observed data and store the learned knowledge in the NN’s weights and biases. Famous end-to-end TTS systems are Transformer TTS, Tacotron, and FastSpeech.
Since the speech corpora for non-professional speakers or ALS patietns generally can not utter large amount of speech data, the model adaptation approach is suitable in the case of constructing personalized TTS systems. There are mainly two methods to perform speaker adaption with a few data. One is fine-tuning the pre-trained model[1]. Another one is adopting the speaker embeddings to model the information of each speaker. Usually, the speaker encoders are trained with the TTS pipeline[2], [3]. However, one could borrow the encoders from speaker verification[3], [8], e.g., i-vectors[4], x-vectors[5], LDE[6], GE2E[7], and use those embeddings as the extra inputs to the TTS model. With this approach, one can modify the embedding to interpolate or extrapolate the voice of different speakers, i.e., the zero-shot TTS. The performance evalution of different encoders can be found in [9].
In respect of vocoders, new waveform generation models based on the diffusion process are proposed, e.g. [10], [11]. Recently, some researchers combine the source-filtering model or harmonic plus noise model with GAN[12]-[14]. Most vocoders directly generate speech signal from raw spectrum.
The most of the previous studies constructed personalized TTS systems with corpora of speakers without difficulties in speaking.
The speakers diagnosed as ALS would gernerate creaky voice that the F0 manifest jitters and shimmers. Besides, the spectral envelope that represents voval tract information would be severely distorted by less controlling of ALS patients.
---
## Previous Study Regarding Mandarin Personalized TTS for ALS Patients
Before 2018, the projects or services regarding personalized TTS for ALS patients were mainly deveoted to English-speaking users. Inspired by the project revoice, the Ministry of Science and Technology, Tawian, and the Taiwan Motor Neuron Disease Association initiated a project to construct smart communication systems for ALS patients. The name of the project is “Life Fix experimental project series — case 1: Developing and integrating the smart communication systems for ALS patients." As shown in Fig. 1, the project constructed a prototype personalized TTS system constructed with techniques of voice conversion, speech denoising, and super-resolution (Huang et al., 2019). This system was designed to convert voice identity of the commercial TTS system with the professional speaker's voice to the target ALS patient's voice. The speech corpora collected for the training of this voice conversion were usually noisy, recorded in low sample rates, and compressed by some codecs because the collected speech corpora were recorded in the patients' daily lives with their cellular phones or handheld video cameras. Therefore, the system operates with a upsampling (or super-resolution) to expand bandwidth of the converted speech. The system is advantageous in the following three folds:
1. Constructing a TTS system with high speech quality requires large amounts of speech corpus. ALS patients usually cannot provides such large speech corpora. Voice convertion technique may require much less speech corpora to construct an acceptable system than constructing a TTS system does. Therefore, utilizing an existing commercial TTS system to synthesize speech and covert the synthesized speech to the voice identity of the target speaker is economical.
2. Denoising the codeced and noisy speech made it as much as possible to use the patients' limited speech recordings to construct voice conversion mechanism.
3. Super-resolution may increase speech qualities of the speech generated by the commercial TTS system and converted by the voice conversion.
```graphviz
digraph {
compound=true
graph [ fontname="Source Sans Pro", fontsize=12 ];
node [ fontname="Source Sans Pro", fontsize=12];
edge [ fontname="Source Sans Pro", fontsize=12 ];
subgraph core {
I1 [label="text" fixedsize=false] [shape=plaintext]
}
subgraph core {
T [label="General Purpose TTS\n(provided by Google or Microsoft)" fixedsize=false] [shape=box]
}
subgraph core {
O1 [label="synthesized voice\n (professional speaker)" fixedsize=false] [shape=plaintext]
}
subgraph core {
VC [label="voice conversion" fixedsize=false] [shape=box]
}
subgraph core {
O2 [label="converted voice of\n target speaker\n (with low bandwidth)" fixedsize=false] [shape=plaintext]
}
subgraph core {
SR [label="upsamping\n(super-resolution)" fixedsize=false] [shape=box]
}
subgraph core {
O3 [label="converted voice of\n target speaker\n (with high bandwidth)" fixedsize=false] [shape=plaintext]
}
subgraph core {
VM [label="voice conversion\nmodel" fixedsize=false] [shape=cylinder]
}
subgraph core {
SM [label="super-resolution\nmodel" fixedsize=false] [shape=cylinder]
}
I1->T->O1->VC->O2->SR->O3
VM->VC
SM->SR
}
```
###### Fig. 1: A personalized TTS system cobstructed by a commercial TTS system augmented with voice conversion and bandwith expansion.
```graphviz
digraph {
compound=true
rankdir=TD
graph [ fontname="Source Sans Pro", fontsize=12 ];
node [ fontname="Source Sans Pro", fontsize=12];
edge [ fontname="Source Sans Pro", fontsize=12 ];
subgraph core {
I1 [label="speech of target speaker\n(noisy and low-bandwidth)" fixedsize=false] [shape=plaintext]
}
subgraph core {
Ix [label="text associated with\nspeech of target speaker" fixedsize=false] [shape=plaintext]
}
subgraph core {
GPT [label="General Purpose TTS\n(provided by Google or Microsoft)" fixedsize=false] [shape=box]
}
Ix->GPT
subgraph core {
I2 [label="synthesized speech of professional speaker" fixedsize=false] [shape=plaintext]
}
GPT->I2
subgraph core {
T [label="speech denoising" fixedsize=false] [shape=box]
}
subgraph core {
SDM [label="speech denoising model" fixedsize=false] [shape=cylinder]
}
subgraph core {
O1 [label="denoised speech of target speaker" fixedsize=false] [shape=plaintext]
}
subgraph core {
VC [label="training of voice conversion" fixedsize=false] [shape=box]
}
subgraph core {
VM [label="voice conversion model" fixedsize=false] [shape=cylinder]
}
I1->T->O1
SDM->T
O1->VC
I2->VC
VC->VM
}
```
###### Fig. 2: The training of the voice conversion model used in Fig. 1.
> Huang, B.-H., Liao, Y.-F., Deng, G.-F., Pleva, M., & Hládek, D. (2019, December). 適合漸凍人使用之語音轉換系統初步研究 (Deep Neural-Network Bandwidth Extension and Denoising Voice Conversion System for ALS Patients). International Journal of Computational Linguistics & Chinese Language Processing, Volume 24, Number 2, December 2019. https://aclanthology.org/2019.ijclclp-2.3
Beside of the voice conversion-based system, the project also has constructed the Voice Bank website (Liao, n.d.) to encourage ALS patients to record their voices as soon as possible. The website also encourages speech donation volunteers to register and contribute their voices, increasing as much variety of speaker identities as possible. With these collected speech corpora, the patients may use *proxy* TTS systems constructed by the speech donation volunteers' voices that are similar to the patients' voices. In addition, the performance of the voice conversion-based system may be improved by incorporating these collected large amounts of speech corpora into the training.
> Liao, Y.-F. (n.d.). Voice Bank. https://ivoice.tw/
---
## Difficulties
We can find from the discussion in the previous sessions that there were no projects or commercial services directly providing personalized Mandarin TTS construction setvices. Although the Life Fix project has already constructed some technologies to build voice conversion method to transform voices generated from some commercial TTS systems and the voice banking platform to collect voices from speech donors, the project encountered the following difficulties:
1. Data Scarcity in ALS patients' voices: the Life Fix project was a pilot study that only collected the speech corpora of two ALS patients. The collected corpora were all recorded by cellphones or handheld video cameras with low speech quality. The usable speech corpus for each of the ALS patients was at most 16 minutes. This data scarcity limited the performance of the voice conversion.
2. Limit on Technologies: The project has already applied deep learning technologies to the voice conversion, speech denoising, and super-resolution. Note that the speech denoising and the super-resolution have been constructed with large amounts of speech data. The constructed personalized TTS systems for the enrolled ALS patients were still far from practicle use. Therefore, the constructed personalized TTSs were never used by the enrolled patients in their daily lives.
3. Data Scarcity in Speaker Variety of the Voice Bank: In the one year project, the Voice Bank has not collected enough speech corpora to cover speaker idenities that the ALS patients needed.
The above-mentioned three difficulties pushed the Ministry of Science and Technology, Tawian, and the Taiwan Motor Neuron Disease Association, to seek a personalized TTS solution which directly provides Mandarin-speaking ALS patients sevices of speech recording and personalized TTS system construction. It is hoped that this personalized TTS solution can be applied to each ALS patient before his/her severe dysarthria occurs. This solution will still face the following problems:
1. Constructing a high-quality deep learning-based TTS system need large amounts of speech corpus for a target speaker. For a non-professional speaker, it is hard to record a large speech corpus in a short time.
2. ALS patients generally need to make more effort than others to record speech even no significant dythrasia occurs.
3. ALS patients may not able to speak fluently. The degree of speech fluency may depend on each patient's condition. This situation makes speech data hard to be used in constructing a high-quality personalized TTS system.
4. The recording of speech usually is done in a professional studio where ALS patients are inconvenient to commute to.
We therefore proposed the "save and sound project" to overcome the above-metioned difficulties.
---
## Purpose of the Present Study
With the experiences derived from the previous projects, we may conclude that ALS patients are encouraged to record their voices before dythrasia occurs. The present study aims to give solutions to the difficulties discussed in the previous subsection. We address the purpose of this study by listing points in the following:
1. Enrolled ALS patients' variety: We invited ALS patients in different conditions regarding 1) degree of dythrasia, and 2) whether or not respirators are used.
2. Proxy speaker: If the enrolled ALS patients have difficulty in speaking fluently (probably dysarthria occurs) and they accept using synthesized speech with speaker identities of their close relatives, we encourage the patients may use their close relatives' voices as "proxy voices". For example, a female patient may adopt the voice of her daughter as the proxy voice because a mother and her daughter may have similar voice identity.
3. Customized corpus design: We designed various types of text materials for recording suitable for ALS patients according to their conditions. The recorded corpora associated with the text materials should cover all initials, final, and tones of Mandarin Chinese. The corpus for a speaker need to be economical that has small size but covers wide phonetic variety that would represent a speaker's identity well.
4. On-site recording: We provide on-site recording services for the enrolled ALS patients. The patients do not need to commute to the studio. The recording engineers would guide ALS patients when recording. This service ensures high-quality recording that can be fully used in constructing a personalized TTS system.
5. Making constructed personalized TTS systems be used in patients' daily life rather than just experimental prototypes: This project integrates the advantages of deep learning technologies and the speech labeling and modeling framework/method to construct personalized TTS systems with limited amounts of speech corpora recorded by the patients. The speech labeling and modeling framework is developed by the SMSPLab of NTPU, Taiwan, providing sophisticated phonetic and prosody labelings which are crucial for constructing TTS systems. The deep leaning-based models are utilized to construct mapping from text input to the labelings by the speech labeling and modeling framework. The proposed method has already applied to construct a commercial TTS used in the media for news reading. We therefore expect that the method may also benefit the ALS patients.
---
# Method
## Overview
We summarize the method of constructing personalized TTS systems for the ALS patients according to the research progress made during April 2020 to Augest 2021. The overview of the method is shown in Fig 3. The project was conducted with five parts:
1. participant enrolment,
2. corpora design and speech assessment,
3. corpora recording and preprocessing,
4. construction of personalized text-to-speech models, and
5. uses/tests of the constructed personalized TTS systems.
The project started with the "1. participants enrolment" and "2. corpus design and speech assessment." The 1st and 2nd parts were conducted at the same time. The part "1. participants enrolment" was conducted by the Taiwan Motor Neuron Disease Association to invite ALS patients join in the project. The part "2. corpora design and speech assessment" was mainly operated by SMSPLab, NTPU, and AcoustInTek Co., Ltd., Taiwan. After several participants agreed to join in the project, the 3nd part "3. corpora recording and preprocessing" was conducted to record participants' voices and their recordings were edited to be useful for the 4th part "4. construction of personalized text-to-speech models." Last, in the 5th part, the constructed personalized TTS systems were examined by the users, i.e., the enrolled ALS patients, and their caregivers. The performaces of the constructed TTS systems may be unsatisfactory because of the following two reasons:
1. The enrolled speakers did not record enough speech data.
2. The text materials for recording were not suitable for the patients to utter voices that manifest their own voice identities.
Therefore, parts of the enrolled patients have recorded more than one recorsing session.
```graphviz
digraph {
compound=true
rankdir=TD
graph [ fontname="Source Sans Pro", fontsize=12 ];
node [ fontname="Source Sans Pro", fontsize=12];
edge [ fontname="Source Sans Pro", fontsize=12 ];
subgraph core {
S [label="Start"] [shape=oval]
}
subgraph core {
A [label="2. Corpora Design and Speech Assessment"] [shape=box margin=0.2]
}
subgraph core {
B [label="3. Corpora Recording and Preprocessing"] [shape=box margin=0.2]
}
subgraph core {
C [label="4. Construction of Personalized\nText-to-Speech Models
"] [shape=box margin=0.15]
}
subgraph core {
D [label="5. Uses/Tests of Constructed Personalized TTS systems"] [shape=box margin=0.2]
}
subgraph core {
E [label="End"] [shape=oval]
}
subgraph core {
R [label="1. Participants Enrolment"] [shape=box margin=0.2]
}
S -> A
{
rank=same
S;A;
}
S -> R
{
rank=same
S;R;
}
R -> B
A -> B -> C -> D
D -> E
{
rank=same
D;E;
}
D -> B
}
```
###### <center> Fig. 3: The overview of the proposed method </center>
---
## Participants Enrolment
The parcipants enrolment was organized by the the Taiwan Motor Neuron Disease Association (TMNDA). The TMNDA started to call for parcipation with the flyer https://hackmd.io/v8EbM8ibSeqawGDJePsqDw on Feb. 16, 2020. The social workers of the TMNDA queried their clients' willingess of participating in the project. The maximum number of the enrolled patients were set to be 20 and we hoped that at least ten personalized TTS systems can be constructed and tested by the participants. The project was planned to be run in one year with three phases. Each of the phases took four months to run. We planned to construct at least three, three, and four personalized TTS systems, respectively during the first, second, and third phases.
The project enrolled participants that were willing to test the feasibility of the proposed personalized TTS solution. Therefore, we expected to enroll ALS patients in various conditions. The conditions considered are described as follows:
1) The project wellcomed and encouraged the ALS patients who could still utter intelligible speech to participate in the project. The enrolled patients may have difference degrees of dythrasia or speech disorder. In addtion, the patients who used respirators may still utter very intelligible speech. We expected that the personalized TTS systems can be constructed with intelligible speech.
2) If the enrolled ALS patients could not utter intelligible speech and they accepted using synthesized speech with speaker identities that are close to their original ones, we encourage the patients may use the voices of "proxy speakers." A patient's close relatives usually may serve as the proxy speaker to record speech used in constructing a personalized TTS system for the patient. For example, a female patient may adopt the voice of her daughter as the proxy voice because a mother and her daughter may have similar voice identity.
The participants enrolment did not only invite patients to join in the project, but also let the patients understand that it is important to record their voices before severe speech disorder make patients lose their voice identities. The reasearch team also explained to the enrolled patients that the similarity of the voice built by the technology to the patients' original voices may depend on the degrees of speech disorder. "Banking" patients' voices make reconstructing their voice identities possible.
## Corpora Design and Speech Assessment
### *The Loop with Adjustment*
In order to record mimimum number of utterances that efficiently provided speakers' identity, we needed to design suitable text materials for ALS patients in various conditions. As shown in Fig. 4, the corpora design and speech assessment formed a loop to adjust the designed text passages for ALS patients' recording regarding the speech assessment of the trial recorded speech examples.
We started with designing a phonetic-balanced text passages excerpted from the Sinica Treebank text corpus. The text passages are closed to sentences. Then, we invited the patients to conduct trial recordings with 1 to 3 sentences by their smart phones. After obtaining the trial recordings, we found that the designed passages were too long for part of the patients that have difficulty in speaking fluently. We therefore further designed a two-syllable-word and a mono-syllable-word text materials to alleviate the patients' loading. In the meanwhile, we also concluded the speech assessment method with four degrees. Last, we would provide the text materials most suitable for the enrolled patients to rehearal their on-site recording.
```graphviz
digraph {
compound=true
rankdir=TD
graph [ fontname="Source Sans Pro", fontsize=12];
node [ fontname="Source Sans Pro", fontsize=12];
edge [ fontname="Source Sans Pro", fontsize=12 ];
subgraph core {
S [label="Start"] [shape=oval]
}
subgraph core {
A [label="Corpora Design (NTPU+AIT)"] [shape=box margin=0.2]
}
subgraph core {
B [label="Providing Text Material\nof the Designed Corpora (MNDA)"] [shape=box margin=0.2]
}
subgraph core {
C [label="Trial Recording (1 to 3 utterances)\nwith Smart Phones (Patients)"] [shape=box margin=0.2]
}
subgraph core {
D [label="Speech Assessment (NTPU+AIT)"] [shape=box margin=0.2]
}
subgraph core {
F [label="Providing Complete Text Materials of the Corpora\nfor Recording Rehearsal (NTPU+AIT)"] [shape=box margin=0.23]
}
subgraph core {
E [label="End"] [shape=oval]
}
S -> A
{
rank=same
S;A;
}
A -> B -> C -> D -> F
D -> A
A -> F
F -> E
{
rank=same
F;E;
}
}
```
###### <center> Fig. 4: The overview of the corpora design and speech assessment. </center>
### *Speech Assessment*
We classified enrolled speakers into four degrees according to the following features of the trial recordings (1 to 3 utterances):
1) speech intelligibility,
2) dysarthria,
3) fluency regarding prosody, and
4) speaking rate.
The above four features were measured subjectively accroding to the reasearch team's experiences in constructing TTS systems and speech analyses on the trial recordings of the first six ALS patients that participated in this project. The characteristics of the four degrees are shown in Table 1.
<br>
###### <center> Table 1: The four degrees and the corresponding characteristics </center>
| degree | characteristics |
| -------- | -------- |
| 1st | fluent speech as ordinary people |
| 2nd | disfluent in prosody |
| 3rd | light dysarthria but high speech intelligibility |
| 4th | dysarthria and low speech intelligibility |
### *The Designed Corpora*
The designed corpora are briefly described by Table 2.
###### <center> Table 2: Description of the designed corpora </center>
| <small>corpus name</small>| <small>degree by speech assessment</small>| <small>text examples</small> | <small>size</small> |
| ----------------------- | ---------------- | ----------------------------------------- |-----|
| <small>y1: 411 base syllables </small> | <small>3/4</small> | <small>知(ㄓ) 吃(ㄔ) 師(ㄕ) 日(ㄖˋ) ... 阿(ㄚ) </small>| |
| <small>w1: 2-syllable words</small> | <small>3/4</small> | <small>一樣 (ㄧˊ ㄧㄤˋ) ... 午安 (ㄨˇ ㄢ) </small> | |
| <small>s1: sentences A</small> | <small>1/2</small> | <small>以二號女友為例,她堅持不下廚,...就形成對立局面。</small> | <small>60 utterances </small>|
| <small>s2: sentences A-1</small> | <small>1/2</small> | <small>我在網路上看到一家新開的餐廳,… </small> | <small>75 utterances</small> |
| <small>s3: sentences B</small> | <small>3/4</small> | <small>但可以想像,不外乎是來自貧窮。 </small> | <small>15 utterances </small>|
---
## Corpora Recording and Preprocessing
It is known that most English-speaking personalized TTS services can be used from home with quality microphones. Users may bank their voices by using the Apps provided by service providers. The Apps generally could automatically guide users through the voice banking processes and monitor the quality of recording regarding volume, backgound noise, or even the correctness of pronunciations. The ALS patients may use the services by themselves if capable.
This project, however, did not construct Apps as the services as described above. Instead, we provided on-site recording services for the enrolled ALS patients. The patients may bank their voice at their home with the help of the research team in person. The on-site recording is adopted because of the following reasons:
1) This project is a pilot study aiming to construct personalized Mandarin TTS systems that can be used in ALS patients' daily lives. The Apps for recording are not ready.
2) It is not easy for most of the enrolled patients to commute to the professional recording studio. The on-site recording may increase the patients' willingness of banking voices.
3) ALS patients would make more effort in reading text materials than odinary people do. The on-site in person assistance in recording may alleviate patients' workload on reading, speaking, and recording.
4) The performace of a TTS system may depend on the quality of recording. The on-site recording engineers may ensure all the recording setups meet the high-quility requirements.
5) Taiwan is densely populated. Besides of room acoustics, like reverbation, echo, and natural modes, recording at home would be interfered by environmental noises from neighbours or outdoors. The on-site recording engineers may help to reduce picking up some unwanted interference.
Fig. 5 shows the flowchart of corpora recording and preprocessing. First, for each of the enrolled patients, the venue/site for the recording was arranged by both the patient and the research team. The venue can be patient' home, the TMNDA office, or other suitable places. Then, the reseach team went to the appointed venue and recorded the speech of enrolled patient with a quality headset directional microphone, a high SNR mixer, and a laptop. All these recording devices are connected by digital terminals to ensure a high SNR recording.
For each arranged on-site recording session, two or three recording engineers join the recording session to assist in corpora recording and preprocessing. One recording engineer controled the recording devices. The other engineer helped or instructed the enrolled speaker on reading the text materials of the corpora customized designed. If available, the on-site third engineer may conduct preprocessing of the record speech to edited the recorded speech wave files and corrected the texts according to the speech recorded if necessary. If the size of recorded speech was not enough or the enrolled speaker was capable to record more speech, more recording session can be arranged.
```graphviz
digraph {
compound=true
rankdir=TD
graph [ fontname="Source Sans Pro", fontsize=12];
node [ fontname="Source Sans Pro", fontsize=12];
edge [ fontname="Source Sans Pro", fontsize=12 ];
S [
label = "Start";
shape = oval;
];
A [
label = "Appointment of Recording (MNDA+AIT)";
shape = box;
];
B [
label = "Corpora Recording at Appointed Site \n(Patients' Home or MNDA) (Patients+AIT)";
shape = box;
margin =0.15;
];
C [
label = "Speech Waveform Editing and Preprocessing (AIT)";
shape = box;
margin = 0.15
];
D [
label = "Correction of Texts According to the Recordings (AIT)";
shape = box;
];
F [
label = "Corpura Enough?(NTPU+AIT)";
shape = diamond;
];
E [
label = "End\n(Corpora\nRecording\nFinished)";
shape = oval;
];
S -> A -> B -> C -> D -> F
F -> A [label="No"]
#{
# rank=same;
# F; E
#}
F -> E [ label = "Yes" ];
}
```
###### <center> Fig. 5: The overview of the corpora recording and preprocessing. </center>
---
## Construction of Personalized Text-to-Speech Models
### *The Knowledge-Rich TTS Framework*
Fig. 6 shows the TTS framework adopted for the personalized TTS system. The TTS framework is inspired by the speech production process [15,16]. The process can be interpretable by the knowledge gained in linguistics and signal processing. We, therefore, call this framework “knowledge-rich TTS framework.” The framework maps from text to speech in the following function composition:
$$speech=TTS(text)=WG(SG(PG(TA(text))))$$
The functions are:
1) TA: text analysis
2) PG: prosody generation
3) SG: speech parameter generation
4) WG: waveform generation
```graphviz
digraph {
compound=true
rankdir=TD
graph [ fontname="Source Sans Pro", fontsize=16 ];
node [ fontname="Source Sans Pro", fontsize=16];
edge [ fontname="Source Sans Pro", fontsize=16 ];
subgraph core {
T [label="text"] [shape=plaintext]
}
subgraph core {
TA [label="TA: text analysis"] [shape=box]
}
subgraph core {
L [label="L: linguistic features"] [shape=plaintext]
}
subgraph core {
PG [label="PG: prosody generation"] [shape=box]
}
subgraph core {
P [label="P: prosody parameters"] [shape=plaintext]
}
subgraph core {
SG [label="SG: speech parameter generation"] [shape=box]
}
subgraph core {
A [label="A: acoustic parameter"] [shape=plaintext]
}
subgraph core {
WG [label="WG: vocoder/waveform generation"] [shape=box]
}
subgraph core {
VM [label="VM: vocoding model"] [shape=cylinder]
}
subgraph core {
S [label="y: synthesized speech"] [shape=plaintext]
}
subgraph core {
PGM [label="PGM: prosody generation model"] [shape=cylinder]
}
subgraph core {
AM [label="AM: acoustic model for speech synthesis"] [shape=cylinder]
}
T->TA
TA->L
L->PG
L->SG
PG->P
P->SG
SG->A
A->WG
WG->S
PGM->PG
AM->SG
VM->WG
}
```
###### <center> Fig. 6: The framework of the personalized TTS systems. </center>
### *Text Analysis (TA)*
The TA extracts linguistic information from the input text. The linguistic information contains lexical, syntactic, and partly semantic features. The TA consists of the following processes in sequence:
1) [Rule-based text normalization (RBTN)](https://github.com/cewarman/NTPU_online_text_normalization): this process converts wrtten form text to spoken form text.
2) Word segmentation and POS tagging (parser): this process tags spoken-form text into word sequences with the corresponding part-of-speech labels.
3) Dictionary-based pronunciation labeling for Chinese: this process labels each Chonese word Mandarin pinyin and the lexical tones for each syllable in the word.
4) Polyphoneme disambiguation: this process tags polyphoneme Chinese characters with correct pronunciations.
5) Tone sandhi labeling: this process re-label the tones of a syllable sequnece with consecutive tone threes.
6) POS tagging of English words: this process tags POSs of English words in a mixed Chinese-English text.
7) Pronunciation labeling for English words: A hybrid of dictionary lookup and CMU letter-to-sound rules is implemented.
### *Prosody Generation (PG)*
The PG produces prosodic information by the prosody generation model (PGM) given with the linguistic features extracted by the TA. The produced prosodic information contains prosodic breaks and prosodic states. The PG consists of the following processes in sequence:
1) Break prediction
This process generates prosodic breaks ($B$) given with the linguistic features ($L$) derived by the TA and the speaking rate ($x$) specified by the user. The break prediction model is a decision tree that describes the probability $Pr(B|L,x)$.
2) Prosodic state prediction
This process generates the prosodic state sequence ($P=\{p, q, r\}$) given with the prosodic breaks ($B$) predicted in the break prediction and the user specified speaking rate $x$; where $p$, $q$, and $r$ represent the prosodic state sequences for pitch, duration, and energy, respectively. Note that the prosodic state $p$, $q$, and $r$ can be viewed as quantized representations for the global intonation patterns of an utterance. The prosodic state prediction is powered by the SR-dependent prosodic state model($Pr(P|B,x,\lambda_p)$) constructed by Markov models and the prosodic state-syntax model ($Pr(P|L,\lambda_{PL})$) constructed by a decision tree model.
### *Speech Parameter Generation (SG) and Waveform Generation (WG)*
The SG generates speech parameters that control the fundamental frequency and spectral envelope by the acoustic model for speech synthesis. The WG produces waveform by vocoding models given with the speech parameters generated by the SG. The speech parameters are generated by the following two processes in sequence:
1) State duration generation: This process utilizes a deep neural network-based model to generates durations of acoustic units given with the linguistic features extracted by the TA and the prosodic information generated by the PG. In this study, the acoustic units are initial or final for Mandarin syllable, and phoneme for English. Each initial, final, and phoneme are modeled by a five-state one-state-forward-only hidden Markov model (HMM). The HMMs in this study are constructed by the HMM-based speech synthesis (HTS) method with the speaker adaptation technique of CMAPLR (Constrained maximum a posteriori estimation linear regression). The HMM state duration are obtained by the forced-alignment with the speaker-adaptive HMMs (by the CMAPLR method).
2) Acoustic parameter generation: This process adopt a deep neural network-based model to generates frame-based acoustic parameters for the following vocoder given with the predicted HMM state duration, the linguistic features extrcated by the TA, and the prosodic information produced by the PG. The acoustic parameters include spectral envelop by mel-generalized ceptrum (MGC), voiced/unvoiced flag, and logF0 parameters.
The structure of the deep neural network models for the state duration generation and the acoustic parameter generation will be discussed in the following seubsection "Speech Labeling and Modeling". The WG in this study adopt the WORLD vocoder that converts the information of spectral envelope derived from MGC, voiced/unvoiced flag, and logF0, to speech waveform.
### *Speech Labeling and Modeling*
Fig. 7 shows the framework of the speech labeling and modeling for constructing personalized TTS systems for the enrolled ALS patients. The inputs are text and the associated speech, and last we obtain a knowledge-rich TTS system powered by the trained prosody generation model and acoustic model for speech synthesis.
First, the 1) text analysis extract linguistic features from the input text. The 2) acoustic feature extraction produces frame-based acoustic features such as spectrogram, MFCC, F0, and frame power. With the linguistic and acoustic features extracted, the 3) linguistic-speech alignment labels beginning and end time instances of speech units. For example, the speech units are initial and final for Mandarin syllables and phonemes for English words. The time intervals for some upper-layer linguistic units, e.g., syllables, words, and sentence-like units (delimited by punctuation marks), are also available.
```graphviz
digraph {
compound=true
rankdir=TD
graph [ fontname="Source Sans Pro", fontsize=36 ];
node [ fontname="Source Sans Pro", fontsize=36];
edge [ fontname="Source Sans Pro", fontsize=36 ];
subgraph core {
I1 [label="text" fixedsize=false] [shape=plaintext]
}
subgraph core {
I2 [label="speech" fixedsize=false] [shape=plaintext]
}
subgraph core {
A [label="1) text analysis" fixedsize=false] [shape=box margin=0.2 style=bold]
}
subgraph core {
Q [label="2) acoustic feature extraction" fixedsize=false] [shape=box margin=0.2 style=bold]
}
subgraph core {
R [label="acoustic features" fixedsize=false] [shape=plaintext]
}
subgraph core {
L [label="linguistic features" fixedsize=false] [shape=plaintext]
}
subgraph core {
B [label="3) linguistic-speech alignment" fixedsize=false] [shape=box margin=0.2 style=bold]
}
subgraph core {
Z [label="linguistic-speech alignment" fixedsize=false] [shape=plaintext]
}
subgraph core {
X [label="4) integration of syllable-based linguistic and\nprosodic-acoustic features" fixedsize=false][shape=box margin=0.25 style=bold]
}
subgraph core {
U [label="syllable-based linguistic and prosodic-acoustic\nfeatures"][shape=plaintext margin=0.2]
}
subgraph core {
C [label="5) prosody labeling"] [shape=box margin=0.2 style=bold]
}
subgraph core {
T [label="prosody tag: prosodic break and\n prosodic state"] [shape=plaintext margin=0.2]
}
subgraph core {
D [label="6) construction of prosody\ngeneration model"] [shape=box margin=0.2 style=bold]
}
subgraph core {
E [label="7) construction of acoustic models\nfor speech synthesis"] [shape=box margin=0.25 style=bold]
}
subgraph core {
F [label="AM: acousitc model for speech synthesis"] [shape=plaintext margin=0.2]
}
subgraph core {
G [label="PGM: prosody generation model"] [shape=plaintext margin=0.2]
}
subgraph core {
H [label="Knowledge-Rich TTS" style=bold] [shape=box margin=0.2]
}
subgraph core {
HI [label="text"] [shape=plaintext margin=0.1]
}
subgraph core {
HO [label="synthesized speech"] [shape=plaintext margin=0.2]
}
I1->A [arrowsize=2.0]
R->B [arrowsize=2.0]
I2->Q [arrowsize=2.0]
Q->R [arrowsize=2.0]
A->L [arrowsize=2.0]
L->B [arrowsize=2.0]
B->Z [arrowsize=2.0]
C->T [arrowsize=2.0]
T->D [arrowsize=2.0]
D->G [arrowsize=2.0]
T->E [arrowsize=2.0]
E->F [arrowsize=2.0]
#L->C [arrowsize=2.0]
#L->D [arrowsize=2.0]
R->E [arrowsize=2.0]
#Z->E [arrowsize=2.0]
#L->E [arrowsize=2.0]
#Z->D [arrowsize=2.0]
#R->D [arrowsize=2.0]
X->U [arrowsize=2.0]
R->X [arrowsize=2.0]
Z->X [arrowsize=2.0]
#L->X [arrowsize=2.0]
U->C [arrowsize=2.0]
F->H [arrowsize=2.0]
G->H [arrowsize=2.0]
HI->H [arrowsize=1.0]
H->HO [arrowsize=1.0]
#Z->C [arrowsize=2.0]
I2->E [arrowsize=2.0]
subgraph cluster_TTS {
style=dotted
{rank=same HI H HO}
}
}
```
###### <center> Fig. 7: The flowchart of constructing personalized TTS systems. </center>
<br>
Next, the 5) prosody labeling requires observing supra-segmental prosodic-acoustic features to label speech with prosodic structure. Before the labeling, we extracts syllable-based prosodic-acoustic and linguistic features by the step "4) integration of syllable-based linguistic and prosodic-acoustic features." This is because we regard syllables as the basic units for prosodic segments. Therefore, the syllable-based prosodic-acoustic features over an utterance form supra-segmental prosodic features for prosody labeling.
Then, 5) the prosody labeling labels the speech corpus with the following prosodic tags: 1) prosodic break type and 2) prosodic state. Last, the prosody generation model that cinsists of the sub-models for break prediction and prosodic state prediction can be constructed to learn the mapping from linguistic features to prosodic information given with the prosody tag. Furthermore, the deep neural network-based acoustic models for speech synthesis can be constructed by the derived prosody tags, acoustic features, and input waveforms. The details of 6) construction of prosody generation model can be referred to the studies in speaking rate-dependent hierarchical prosodic model (SR-HPM). The construction of deep neural network-based acoustic models for speech synthesis are newly-developed espcially desiged for this study and will be described in the next subsecton.
### *Deep Neural Network-based Acousic Models for Speech Synthesis*
The HMM state duration generation model and the acoustic parameter generation model are implemented by the same structure as shown in Fig. 8. The differences between the models are input features, targets for the prediction, and cost function. The trainings of these two models need the HMM state alignment obtained by the standard HMM-based speech synthesis (HTS) training with the speaker adaptation technique of CMAPLR (Constrained maximum a posteriori estimation linear regression). The training corpora for the HTS training and the deep neural network-based acoustic models comprise speech datum of enrolled ALS patients and two professional speakers. The corpus size of each enrolled ALS patients are generally less than 30 minutes while the sizes of the corpora for the two professional speakers are larger than 5 hours.
<!--[](https://i.imgur.com/MCkFWdD.png)-->
###### Fig. 8: The structure for deep neural network-based state duration model and acoustic model. Up: basic convolution block (ConvBlock); down: full structure.
The structure in Fig. 8 is composed of a stacked layers of the convolution blocks (ConvBlock), a left-to-right (causal) LSTM, and two feedforward layers, from input to output in sequence. The hyperbolic tangent activation functions are inserted between the layers. Each of the ConvBlocks has two input and two output terminals. One of the input terminals accepts the input features ($l_2$) or the output of the proceeding layer, while another accepts one-hot speaker vector ($s$) which affect the speaker embedding modeled by the speaker-dependent weights ($W$) and biases ($b$ and $e$) that represents speaker's identity. One of the output terminals of a ConvBlock is connected to the input of the next ConvBlock or the LSTM, while another output terminal is cross-layer connection to the second last layer to the output. All of these cross layer connections are summed and then connected to the last layer. The speaker embeddding information is also integrated with the LSTM by adding speaker dependent weights and biases.
For the HMM state duration generation, the last linear layer shown in Fig. 8 is then connected to the softplus activation to ensure positive numbers for the 5-state HMM durations for each acoustic unit, and there is no LSTM block after the stack of ConvBlocks. The training criterion for the HMM state duration generation is MMSE (Minimum Mean Square Error). For the acoustic parameter generation, the last linear layer shown in Fig. 8 is connected to two streams of activations that represent a mean vector and a variance vector for a 37-dimensional frame-based acoustic parameters. The acoustic parameters are:
1) 35-dimensional spectral envelope features: 34-order Mel generalized cepstral coefficients
2) 1-dimensional voiced/unvoiced flag
3) 1-dimentional fundamental frequency features: the continuous log F0 curves
The training criterion for the acoustic parameter generation is log likelihood.
---
## Test of Personalized Text-to-Speech Systems
The tests of the constructed personalized TTS systems were arranged as shown in Fig. 9. First, we invited the enrolled patients to provide ten short text passages to the research team. The text passages can be sentences used frequently in their daily communications. Then, for each of the provided text passage, the research team produced synthesized speech by the corresponding constructed personalized TTS systems with seven different speaking rates from slow to fast. These synthesized speech with various speaking rates are used in speaking rate survey (auditioned by the patients and their caregivers) to let the enrolled patients to choose the most suitable speaking rates for their personalized TTS systems. The survey result is adopted to set the speaking rate of the personalized TTS for the later speaker similarity, willingnes, and satisfaction surveys.
The research team deployed the constructed personalized TTS systems on the server located at SMSPLab, NTPU. This server provides the webpage-based personalized TTS GUI. The research team gave usernames and the corresponding passwords to the enrolled patients. The patients may logged in the webpage to finish the four surveys and then use their personal TTSs as AAC in their daily lives. The four surveys are the spaking rate survey, the speaker similarity survey, the willingness to use survey, and satisfaction surveys.
```graphviz
digraph {
graph [ fontname="Source Sans Pro", fontsize=12];
node [ fontname="Source Sans Pro", fontsize=12];
edge [ fontname="Source Sans Pro", fontsize=12 ];
S [
label = "Start";
shape = oval;
];
A [
label = "Providing ten short text passages to\nthe research team (ALS Patients)";
shape = box;
margin = 0.17;
];
B [
label = "Synthesizing speech given the texts\nin various speaking rates (NTPU)";
shape = box;
margin = 0.17;
];
C [
label = "Providing Username and Password\nto the ALS Patients (NTPU)";
shape = box;
margin = 0.17;
];
D [
label = "Loging in the web-based system:\nhttps://rvtw.ce.ntpu.edu.tw (ALS Patients)";
shape = box;
margin = 0.17;
];
F [
label = "Speaking Rate Survey (ALS Patients)";
shape = box;
];
G [
label = "Speaker Similarity Survey (ALS Patients)";
shape = box;
];
H [
label = "Willingness to Use Survey";
shape = box;
];
I [
label = "Satisfaction Survey";
shape = box;
];
E [
label = "End";
shape = oval;
];
S -> A
{
rank=same;
S; A
}
A -> B -> C -> D
D -> F -> G
G -> H -> I ->E
}
```
###### <center> Fig. 9: The flowchart for the test of personalized TTS systems. </center>
# Results
## Participants Enrolled and Corpora Collection
###### <center> Table ?: Numbers of Speakers in the Four Degrees of Speech Assessment </center>
| degree | 1st | 2nd | 3rd | 4th |
| --------| -------- | -------- | -------- | ----- |
| number | 11 | 5 | 4 | 1 |
| remark | remark1 | remark2 | remark2| remark2|
* remark1:9 ALS patients + 2 proxy speakers
* remark2:ALS patients
## Speaker Similarity Survey
- ASL patients and their caregivers were asked to listen to the ten synthesized speech utterances which are generated by the personalized TTS systems given with the text materials provided by the patients.
- For each synthesized speech utterance, the subject (a ALS patient or caregiver) was asked to rate the synthesized speech utterance with the following question:
- The synthesized speech utterance is very close to the authentic speech uttered by the ALS patient.
- The rating is the 5-point mean opinion score:
- 1. strongly disagree (synthesized speech sounded like others)
- 2. disagree
- 3. neutral
- 4. agree
- 5. strongly agree(synthesized speech sounded like the patient)
The result is shown in Table x.
###### <center> Table x: The 5-point speaker similarity survey result </center>
| | M | N | 1 | 2 | 3 | 4 | 5 | |
| ----- | -- | --- | --- | --- | --- | --- | --- | --- |
| all patients | 15 | 150 | 0% | 5% | 16% | 60% | 19% | MOS = 3.92|
| patients 1st degree | 8 | 80 | 0% | 0% | 14% | 76% | 10% |
| patients 2nd degree | 4 | 40 | 0% | 0% | 18% | 33% | 50% |
| patients 3rd degree | 3 | 30 | 0% | 27% | 20% | 53% | 0% |
| all caregiver | 17 | 170 | 6% | 9% | 22% | 49% | 14% | MOS = 3.55 |
| caregiver 1st degree | 7 | 70 | 0% | 21% | 24% | 51% | 3% |
| caregiver 2nd degree | 6 | 60 | 0% | 2% | 15% | 55% | 28% |
| caregiver 3rd degree | 4 | 40 | 25% | 0% | 30% | 35% | 10% |
| patients+caregivers | 32 | 320 | 3% | 8% | 19% | 54% | 16% |
</small>
> <p align="justify"> M represents the number subjects; N represents the numebr of test synthesized utterances; the numbers 1-5 are 5 point MOS in speaker similarity. </p>
---
## Willingness to Use Survey
- This survey was designed to understand in what degree that the enrolled patients and their caregivers were willing to use the constucted personalized TTS systems.
- For the pateints, the question in the survey is: Do you agree with that you are willing to use your personalized TTS system for AAC?
- For the caregivers, the question in the survey is: DO you agree with that you are willing to encourage the patient to use his/her personalized TTS system for AAC?
- The rating is the 5-point mean opinion score:
1. strongly disagree (low wiliness)
2. disagree
3. neutral
4. agree
5. strongly agree (high wiliness)
The survey results for the enrolled patients are shown in Table x while the results for the patients' caregivers are shown in Table x.
###### Table x: The 5-point Willingness to Use Survey result for the enrolled patients
| | M | 1 | 2 | 3 | 4 | 5 |
| ----- | -- | -- | -- | --- | --- | --- |
| all patients | 15 | 0% | 0% | 7% | 40% | 53% |
| 1st degree | 8 | 0% | 0% | 0% | 38% | 63% |
| 2nd degree | 4 | 0% | 0% | 0% | 50% | 50% |
| 3rd degree | 3 | 0% | 0% | 33% | 33% | 33% |
> <p align="justify"> M represents the number subjects; the numbers 1-5 are 5 point MOS in the degree of Willingness to Use </p>
###### Table x: The 5-point Willingness to Use Survey result for the patients' caregivers
| | M | 1 | 2 | 3 | 4 | 5 |
| ----- | -- | -- | -- | -- | --- | --- |
| all caregivers | 17 | 0% | 0% | 0% | 41% | 59% |
| 1st degree | 7 | 0% | 0% | 0% | 71% | 29% |
| 2nd degree | 6 | 0% | 0% | 0% | 17% | 83% |
| 3rd degree | 4 | 0% | 0% | 0% | 25% | 75% |
> <p align="justify"> M represents the number subjects; the numbers 1-5 are 5 point MOS in the degree of Willingness to Use </p>
---
## Satisfaction Survey
- This survey was designed to get the feadbacks from both the enrolled patients and their caregivers to check if they are satisfactory with the performances of the constucted personalized TTS systems.
- The question in the survey for both the patients and caregivers is: Do you agree with that you are satisfactory with the the performances of the constucted personalized TTS systems?
- The rating is the 5-point mean opinion score:
1. strongly disagree (unsatisfactory)
2. disagree
3. neutral
4. agree
5. strongly agree (satisfactory)
| | M | 1 | 2 | 3 | 4 | 5 |
| ----- | -- | -- | --- | --- | --- | --- |
| all patients | 15 | 0% | 7% | 7% | 53% | 33% |
| patients 1st degree | 8 | 0% | 0% | 0% | 75% | 25% |
| patients 2nd degree | 4 | 0% | 0% | 25% | 0% | 75% |
| patients 3rd degree | 3 | 0% | 33% | 0% | 67% | 0% |
| all caregivers | 17 | 0% | 6% | 6% | 59% | 29% |
| caregiver 1st degree | 7 | 0% | 0% | 14% | 71% | 14% |
| caregiver 2nd degree | 6 | 0% | 0% | 0% | 50% | 50% |
| caregiver 3rd degree | 4 | 0% | 25% | 0% | 50% | 25% |
| all=all patients + all caregivers | 32 | 0% | 6% | 6% | 56% | 31% |
> <p align="justify"> M represents the number subjects; the numbers 1-5 are 5 point MOS in the degree of Satisfaction </p>
# Discussion
## Speaker Similarity
We analyze the results of the speaker similarity survey regarding the degree defined by our defined speech assessment.
### *Degree 1*
The speech recorded by the enrolled patients of the 1st degree has the same properties as people who are not diagnosed as ALS. Most of the 1st degree patients agreed (MOS=4) or strongly agreed (MOS=5) with that the synthesized speech utterances are very close to the authentic speech uttered by the ALS patients. The caregivers of the enrolled patients, however, were very familiar with the patients' voice identity to choose more "neutral" (MOS=3) than the patients did.
Among the the 11 1st degree participats, two patients, i.e., speakers #003 and #016, have started to use the constructed personalized TTS systems with the eye tracking and the accessibility keyboard systems to communicate.
003 病友:使用本人語音,目前已在使用本系統
016 病友:使用女兒語音合成,目前已在使用本系統
009 病友:反應系統發音不像本人 (jitter & shimmer)
### *Degree 2*
It is found the patients and the corresonding caregivers of the 2nd degree rated higher MOSs than the those of the 1st degree. The reason for this higher rating may be caused by the following two points:
1. Though the patients of the 2nd degree would generate some disfluent utterances with a lower speaking rate, the patients would speak each syllable more carfully and clearer. The prosody labeling and modeling method utilized in this study labeled utterances with some break tag to describe parts of the disfluency made by the speakers. This could let the speech synthesis model tend to be trained by a gold or stable data to generate well-enunciated speech.
2. The patients of 2nd degree generally speak slower. The constructed personalized TTS systems would generate synthesized utterances with more fluent prosody than the patients did.
### *Degree 3*
Two of the patients (speaker #008 and #013) in 3rd degree used respirators in their daily life. When recording, they still needed the respirators to inhale and produced speech by releasing air from lungs without active muscle contraction. Though the resprirators may generate noises, the noises usually did not co-occur with speech. The speaker #008 can generate very intelligable speech with very little speech disorder (nasanlised syllable). Therefore, the constructed personalized TTS may generate synthesized speech that can be similar (MOS=4) or very similar (MOS=5) to the speaker's authentic speech. The speech of the speaker #013 showed some speech disorder like vowel midification from close vowel to mid vowel. The corresponding constructed personalized TTS generated less natural synthesized speech which were mostly rated as MOS=2 or MOS=3 by the patients and the caregiver.
The recorded speech the speaker #006 was the least intelligibale among all the speakers of the degree 3. The constructed personalized TTS could only generated synthesized speech rated as MOS=1 or MOS=2 by the patients and the caregiver.
The speaker #015 did not need respirator in his/her daily life and could control his/her vocal tract well but has less ability in controlling vocal fold. Therefore, the speaker generally produced creaky voice that manifested much jitter and shimmer. To make his/her synthesized speech more intelligable, the logF0 and unvoiced/voiced submodels in the acoustic model of this speaker was replaced with the ones of the other speaker with the same gender and similar pitch range. The corresponding constructed personalized TTS could generate fairly natural synthesized speech which were mostly rated as MOS=3 or MOS=4 by the patients and the caregiver.
## Willingness to Use and Satisfaction
It was found that most of the enrolled patients and caregivers rated the willingness to use and the satisfaction with high MOSs. Though the project has already provided web-based GUI for the patients to use the system, only two of the enrolled patients use the system in their daily lives. One of the users (speaker #016) used the system constructed by the proxy speaker's voice because the user had already lost her intelligible voice before this project started. The proxy speaker is the patient's daughter. Before using the constructed personalized TTS, the speaker #016 used the TTS of the Google translate as her speech generating device. Another user (speaker #003) used the system contructed by his own voice. The speaker #003 started to use the system while he lost his ability in speaking clearly. The speakers #003
在問卷中皆得到偏向正面的回饋
已提供網頁版使用介面,但是並沒有將系統和眼動滑鼠以及螢幕輔助鍵盤結合起來做實用性測試
2 位病友 (003/016) 日常生活中會使用本系統,其中一位病友 (016) 是使用病友女兒音色的代理語者 TTS,另外男性病友(003)是在參與計畫過程中說話能力逐漸退化
病友 016 在使用本系統之前,就有使用眼動滑鼠+Google翻譯TTS的習慣
病友 003 目前已無法說話,但他在失去語音之前就已經充分練
In the end of the project save and sound, most of the enrolled patients of degree 1 or 2 did not frequently use the constructed personalized TTS systems because they still could speak without significant speech disorder.
## Conclusion
計畫結束時,已建立二十套客製化TTS
得到 17 位病友客製化TTS建立的評測結果
建議:病友在還沒有構音異常前就錄音,就可以建立較好的客製化 TTS
病友戴呼吸器或聲帶控制不良,但口腔舌位控制正常,目前技術上仍可能建立出讓病友以及病友親屬可以接受的客製化 TTS
語音系統的實用性與使用度,與病友是否習慣使用眼動滑鼠有關
建議病友在構音尚未有名顯障礙之前,能儘快錄音,就有機會把個人化語音模型建立好,建立出可以接受的系統。
## Future Works
系統主機伺服器維護:從學校搬遷至外部較為安全之伺服器,如 Asure 平台
加強系統的安全性:內容加密
語音合成品質的改善:合成音質加強,讓聆聽者聽到更清楚的發音
使用者介面的改善:修改使用者介面,讓病友更好使用(希望病友回饋系統使用狀況)、再開發符合平板電腦或是其它行動裝置的使用者介面
完善「VoiceBank 語音銀行」 平台
優點:可依據病友時間方便,隨時錄音,沒有時間壓力,讓病友可以在家中錄音
可能的缺點:在家中錄音可能會有一些環境背景噪音、以及空間回音
增加錄製 「訊息儲存」,將常用的語句錄製好
開放讓病友錄製自己想要錄製的語音,並提供逐字稿方便建立個人化語音合成系統
# References
[1] Z. Kons, S. Shechtman, A. Sorin, C. Rabinovitz, and R. Hoory, “High Quality, Lightweight and Adaptable TTS Using LPCNet,” in Interspeech 2019, Sep. 2019, pp. 176–180. doi: 10.21437/Interspeech.2019-1705.
[2] S. Arik et al., “Deep voice 2: Multi-speaker neural text-to-speech,” arXiv preprint arXiv:1705.08947, 2017.
[3] M. Chen et al., “Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding.,” in INTERSPEECH, 2019, pp. 2105–2109.
[4] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
[5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp. 5329–5333.
[6] W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,” in The Speaker and Language Recognition Workshop (Odyssey 2018), Jun. 2018, pp. 74–81. doi: 10.21437/Odyssey.2018-11.
[7] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.
[8] Y. Jia et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Advances in neural information processing systems, vol. 31, 2018.
[9] E. Cooper et al., “Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, May 2020, pp. 6184–6188. doi: 10.1109/ICASSP40776.2020.9054535.
[10] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” arXiv preprint arXiv:2009.09761, 2020.
[11] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform generation,” arXiv preprint arXiv:2009.00713, 2020.
[12] R. Yoneyama, Y.-C. Wu, and T. Toda, “Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN,” in Interspeech 2021, Aug. 2021, pp. 2187–2191. doi: 10.21437/Interspeech.2021-517.
[13] M.-J. Hwang, R. Yamamoto, E. Song, and J.-M. Kim, “High-fidelity parallel WaveGAN with multi-band harmonic-plus-noise model,” in Proc. Interspeech 2021, 2021, pp. 2227–2231. doi: 10.21437/Interspeech.2021-976.
[14] Y. Hono, S. Takaki, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6049–6053.
Avery, R. J., Bryant, W. K., Mathios, A., Kang, H., & Bell, D. (2006). Electronic course evaluations: Does an online delivery system influence student evaluations? The Journal of Economic Education, 37(1), 21–37. https://doi.org/10.3200/JECE.37.1.21-37
Berk, R. A. (2012). Top 20 strategies to increase the online response rates of student rating scales. International Journal of Technology in Teaching and Learning, 8(2), 98–107.
Berk, R. A. (2013). Top 10 flashpoints in student ratings and the evaluation of teaching. Stylus.
Boysen, G. A. (2015a). Preventing the overinterpretation of small mean differences in student evaluations of teaching: An evaluation of warning effectiveness. Scholarship of Teaching and Learning in Psychology, 1(4), 269–282. https://doi.org/10.1037/stl0000042
Boysen, G. A. (2015b). Significant interpretation of small mean differences in student evaluations of teaching despite explicit warning to avoid overinterpretation. Scholarship of Teaching and Learning in Psychology, 1(2), 150–162. https://doi.org/10.1037/stl0000017
Boysen, G. A., Kelly, T. J., Raesly, H. N., & Casner, R. W. (2014). The (mis)interpretation of teaching evaluations by college faculty and administrators. Assessment & Evaluation in Higher Education, 39(6), 641–656. https://doi.org/10.1080/02602938.2013.860950
Buller, J. L. (2012). Best practices in faculty evaluation: A practical guide for academic leaders. Jossey-Bass.
Dewar, J. M. (2011). Helping stakeholders understand the limitations of SRT data: Are we doing enough? Journal of Faculty Development, 25(3), 40–44.
Dommeyer, C. J., Baum, P., & Hanna, R. W. (2002). College students’ attitudes toward methods of collecting teaching evaluations: In-class versus on-line. Journal of Education for Business, 78(1), 11–15. https://doi.org/10.1080/08832320209599691
Dommeyer, C. J., Baum, P., Hanna, R. W., & Chapman, K. S. (2004). Gathering faculty teaching evaluations by in-class and online surveys: Their effects on response rates and evaluations. Assessment & Evaluation in Higher Education, 29(5), 611–623. https://doi.org/10.1080/02602930410001689171
Feistauer, D., & Richter, T. (2016). How reliable are students’ evaluations of teaching quality? A variance components approach. Assessment & Evaluation in Higher Education, 42(8), 1263–1279. https://doi.org/10.1080/02602938.2016.1261083
Gilovich, T., Griffin, D., & Kahneman, D. (Eds.). (2002). Heuristics and biases: The psychology of intuitive judgment. Cambridge University Press. https://doi.org/10.1017/CBO9780511808098
Griffin, T. J., Hilton, J., III, Plummer, K., & Barret, D. (2014). Correlation between grade point averages and student evaluation of teaching scores: Taking a closer look. Assessment & Evaluation in Higher Education, 39(3), 339–348. https://doi.org/10.1080/02602938.2013.831809
Jaquett, C. M., VanMaaren, V. G., & Williams, R. L. (2016). The effect of extra-credit incentives on student submission of end-of-course evaluations. Scholarship of Teaching and Learning in Psychology, 2(1), 49–61. https://doi.org/10.1037/stl0000052
Jaquett, C. M., VanMaaren, V. G., & Williams, R. L. (2017). Course factors that motivate students to submit end-of-course evaluations. Innovative Higher Education, 42(1), 19–31. https://doi.org/10.1007/s10755-016-9368-5
Morrison, R. (2011). A comparison of online versus traditional student end-of-course critiques in resident courses. Assessment & Evaluation in Higher Education, 36(6), 627–641. https://doi.org/10.1080/02602931003632399
Nowell, C., Gale, L. R., & Handley, B. (2010). Assessing faculty performance using student evaluations of teaching in an uncontrolled setting. Assessment & Evaluation in Higher Education, 35(4), 463–475. https://doi.org/10.1080/02602930902862875
Nulty, D. D. (2008). The adequacy of response rates to online and paper surveys: What can be done? Assessment & Evaluation in Higher Education, 33(3), 301–314. https://doi.org/10.1080/02602930701293231
Palmer, M. S., Bach, D. J., & Streifer, A. C. (2014). Measuring the promise: A learning-focused syllabus rubric. To Improve the Academy: A Journal of Educational Development, 33(1), 14–36. https://doi.org/10.1002/tia2.20004
Reiner, C. M., & Arnold, K. E. (2010). Online course evaluation: Student and instructor perspectives and assessment potential. Assessment Update, 22(2), 8–10. https://doi.org/10.1002/au.222
Risquez, A., Vaughan, E., & Murphy, M. (2015). Online student evaluations of teaching: What are we sacrificing for the affordances of technology? Assessment & Evaluation in Higher Education, 40(1), 210–234. https://doi.org/10.1080/02602938.2014.890695
Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching: The state of the art. Review of Educational Research, 83(4), 598–642. https://doi.org/10.3102/0034654313496870
Stanny, C. J., Gonzalez, M., & McGowan, B. (2015). Assessing the culture of teaching and learning through a syllabus review. Assessment & Evaluation in Higher Education, 40(7), 898–913. https://doi.org/10.1080/02602938.2014.956684
Stark, P. B., & Freishtat, R. (2014). An evaluation of course evaluations. ScienceOpen Research. https://doi.org/10.14293/S2199-1006.1.SOR-EDU.AOFRQA.v1
Stowell, J. R., Addison, W. E., & Smith, J. L. (2012). Comparison of online and classroom-based student evaluations of instruction. Assessment & Evaluation in Higher Education, 37(4), 465–473. https://doi.org/10.1080/02602938.2010.545869
Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76(2), 105–110. https://doi.org/10.1037/h0031322
Uttl, B., White, C. A., & Gonzalez, D. W. (2017). Meta-analysis of faculty’s teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies in Educational Evaluation, 54, 22–42. https://doi.org/10.1016/j.stueduc.2016.08.007
Venette, S., Sellnow, D., & McIntyre, K. (2010). Charting new territory: Assessing the online frontier of student ratings of instruction. Assessment & Evaluation in Higher Education, 35(1), 101–115. https://doi.org/10.1080/02602930802618336
Webb, E. J., Campbell, D. T., Schwartz, R. D., & Sechrest, L. (1966). Unobtrusive measures: Nonreactive research in the social sciences. Rand McNally.
---
## Tables
---
## Figures