# TTS - Exam Questions
:::info
| Info | Description |
| ------: | ----------- |
| Required: | **31 P > 60 %** |
| 1.0: | **49 P** |
| Total amount of superset: | **100 P** |
:::
## Marks
:::danger
**Should be about $15,\overline{6}$ marks per person to answer.**
**Should be about $31,\overline{3}$ marks per person to verify.**
:::
| Marks | Verification | Person |
| ----: | ------------ | ---------- |
| 15 | 30 | Insa |
| 14 | 31 | Christophe |
| 20 | 34 | Anna |
| 14 | 40 | Christine |
| 22 | 26 | Dana |
| 15 | 31 | Svetlana |
## Questions
| Question | Marks | Assigned to | Verified by |
| -------: | ----: | --------------- | -----------------------------|
| 1 | 6 | ~~Christine~~ | ~~Anna~~, ~~Insa~~ |
| 2 | 10 | ~~Dana~~ | ~~Anna~~ |
| 3 | 8 | ~~Anna~~ | ~~Insa~~, ~~Christophe~~ |
| 4 | 7 | ~~Svetlana~~ | ~~Christine~~, ~~Christophe~~|
| 5 | 15 | ~~Insa~~ | ~~Christine~~, ~~Svetlana~~ |
| 6 | 12 | ~~Anna~~ | ~~Christine~~, ~~Dana~~ |
| 7 | 4 | ~~Svetlana~~ | ~~Christophe~~, ~~Dana~~ |
| 8 | 6 | ~~Christophe~~ | ~~Svetlana~~, ~~Dana~~ |
| 9 | 4 | ~~Christophe~~ | ~~Insa~~, ~~Anna~~ |
| 10 | 6 | ~~Dana~~ | ~~Anna~~, ~~Svetlana~~ |
| 11 | 4 | ~~Christophe~~ | ~~Svetlana~~, ~~Dana~~ |
| 12 | 6 | ~~Dana~~ | ~~Christine~~, ~~Insa~~ |
| 13 | 8 | ~~Christine~~ | ~~Anna~~, ~~Christophe~~ |
| 14 | 4 | ~~Svetlana~~ | ~~Insa~~, ~~Christophe~~ |
## 1. Define formants from an acoustic point of view. Which are the three parameters that characterize a formant? How do formants emerge in the speech production process? *(6 P)*
Formants or resonances are the amplification of certain harmonics. More figuratively, they are peaks of frequencies with higher amplitudes.
The three parameters that characterize a formant are frequency, amplitude and bandwidth. The frequency of a formant tells at which frequency the maximal energy lies. The amplitude measures how “loud” a formant is at its maximum, that is the height of the peak. The bandwidth describes the range of frequencies where the difference in loudness from the maximum to this point is less than 3dB and therefore the width of the peak.
There is a connection between amplitude and bandwidth: a higher amplitude comes with a smaller bandwidth and vice versa. That is because the integral of formant function is a constant.
In the source filter model of speech production, formants are created by applying a filter function to the source signal. In human speech production formants are formed by the different shapes of the vocal tract.
> concerning vocals: the frequency of the first formant (F1) depends on the opening height of the mouth, the second formant (F2) on the position of the tongue (formant_synthesis p. 13, 14)
> [name=Anna Welker]
-> slides Formant Synthesis, p.9 & 10, p.16 following:
- frequency, amplitude, bandwidth
-> Taylor:
- chapter 10.5.3 Filter characteristics (pdf 323)
- chapter 7.2.2 Acoustic characteristics (pdf 180)
## 2. Describe the architecture and the functional principle of a formant synthesizer, based on the Klatt synthesizer. Explain the relationship between a formant synthesizer and the source-filter model of speech production. How do formants emerge in the speech production process? *(10 P)*
It's a formant synthesizer, thus it builds on the source filter model. As the source we have the impulse generator and the random number generator. For the filter (=filter function) we have cascade and parallel resonantors. For more info on all the parts, see below.
Impulse Generator: Creates signal which is sent through first set of resonators and amplitude controls (=voicing source).
Cascade and Parallel Vocal Tract Transfer Functions: Create complex signals (based on source filter model, where this is the filter.)
> The Klatt synthesizer is a hybrid of parallel formant synthesizers and cascade formant synthesizers, using the resonator systems of both synthesizer types. The most implementations you can find online use the cascade resonator system for creating vowels and the parallel resonator system for creating consonants.
> [name=Anna Welker]
- Cascade:
- Advantage: Generates formants of correct amplitudes -> + naturalness
- Disadvantage: When two formants are close the resultant values can differ from the input formant parameters (interaction between formants) -> lack of control
- Used for vowels (easier to get right amplitudes)
> creation of the formants: first create the first one, then use what was calculated to create the second one and so on (multiply the frequencies together)
> [name=Anna Welker]
- Parallel:
- Advantage: each formant produced seperately -> frequency for each formant can be controlled
- Disadvantage: amplitudes must be carefully controlled, more transfer functions to feed the result into the output needed (namely, one for each formant computed)
- Used for nasals (one group of resonators for nasal cavity and the other group for oral cavity), fricatives (to model back and front oral cavity with two groups of resonators)
Random Number Generator: Creates stochastic signals mimicking frication.
> creation of the formants: all formants are created seperately and later get combined to the output
> [name=Anna Welker]
Relationship to source-filter model:
Source: Impulse Generator + first set of resonators and amplitude controls (that is, this is used to create the $F_0$ with all its harmonics)
Filter: Cascade and Parallel Vocal Tract Transfer Functions (that is, this is used to manipulate the signal in order to inhibit certain frequencies while enhancing others -> create formant structures).
-> slides formant_synthesis, Klatt: p.19
-> Taylor, Paul: Text-to-Speech Synthesis, page 403 etc.
## 3. What is a diphone? What is the motivation for using the diphone as the basic synthesis unit rather than phones? If diphones are better than phones, why not use triphones or syllables? *(8 P)*
-> slides introduction, p.19: praat example
(also ocurring in unitselection slides, explanation using same example)
- A diphone is a unit starting at the middle of one phone and ending at the middle of the next, so it actually consists of the second half of one phone and the first half of the following. That way, what is featured by the diphone is the transition from one phone to another.
- The motivation of using diphones rather than phones is that when reordering and concatenating them to create new Speech output, the outcome is much "smoother" with diphones. That is because the overall variation in the middle of a phone like [a] is much smaller than e.g. the variation at the border between [a] and [n] in [an] versus the variation at the border between [a] and [k] in [ak].
- sparse data...
- The justification for using diphones follows directly from the target-transition model where we have a stable ‘target’ region (the phone middle) which then has a transition period to the middle of the next phone. Diphone boundaries are at the stable target region, they should have similar vocal tract shapes at these points and therefore join together well. Compare this to concatenating units at phone boundaries is much less likely to be successful as this is where the variance in the vocal tract shape is the greatest.
---> diphones? [Ger: 2,025]
---> triphones? [Ger: 91,125]
---> syllables? [Ger: 12,500+]
- *bullet points in text*
- In german there are roughly 45 phones. To find the number of diphones we square this number. That means there are roughly 2,025 diphones in German. To use triphones we would cube the number of phones. 45^3 is approximately 91,000 triphones for german. To record every triphone for a concatenative synthesis program for German would be expensive and difficult for a small gain in accuracy. Although the situation is not as severe with syllables (12,500+) as triphones (91,000+), it would still be much more difficult than using diphones.
> the question has 8 marks, is this really enough? Altough I don't know what to add...
> [name=Christophe Biwer] [color=#2b5556]
> My thoughts exactly (only for the last part) but I'm not sure what to add here either, as it isn't specified any further on the slides.[name=Insa]
## 4. What was the motivation for using diphones, rather than phones, as the basic building blocks in early concatenative synthesis systems? Which procedures can be used to ensure that the concatenation between any two diphones is maximally smooth or, in other words, that the discontinuities caused by concatenation are minimized? *(7 P)*
Natural speech is subject to coarticulation effects, where the formants of a phone are affected by surrounding phones. Using diphones allows us to include the left/right context of the phone to better model coarticulation effects.
One technique to reduce discontinuities include cutting at the location of minimal spectral change, so usually in the center of two phones, since variation in the middle of a phone is smaller than at the boundary of two phones. Another is to pick a diphone with average pitch and duration so you have to change the pitch/duration less.
> Finally, you can store multiple examples for a diphone and select the best for your context
During concatenation, you can **evaluate spectral discrepencies** at the concatenation point using the DMAX function which returns the maximally accepted formant discrepency:
$DMAX = max(( |T_i-F_i|) / B_i)$ $i=\{1,2,3\}$
$T_i$ = target formant values
$F_i$ = actual formant values
$B_i$ = formant bandwidth (used for normalization)
## 5. Describe the speech synthesis method known as unit selection. What are the differences between unit selection and previous concatenative synthesis methods (e.g. diphone synthesis)? How does unit selection work and what types of acoustic units are available? Discuss problems and solutions pertaining to the design of acoustic unit inventories for unit selection synthesis (e.g. corpus design, domain coverage). *(15 P)*
--> slides unitselection_synthesis (13.12.16)
Unit selection synthesis seeks to avoid the problem of glitches in speech synthesis by synthesizing larger units as a whole whenever possible. Though when concatenation has to occur it may not be very smooth and when the parts don't fit so nicely you still can often hear glitches or other artifacts. Previous concatenative methods had to always use n-length units (units of a fixed length) (e.g. always use diphones), whereas unit selection synthesis can use a larger unit if it is available in the database.
The unit inventory in unit synthesis is a large database of recorded speech sounds segmented into different levels including:
- phrases
- words
- morphemes
- syllables
- diphones
- phones
> For each part, there are several candidates which differ e.g. in prosody or length
During runtime, we select the longest possible units which cover the target phone sequence. We do this by minimizing two cost functions, which represent the similarity of the unit to the target, and how well the unit concatenates with other units.
A practical application for these cost functions would be at the transitions of a graph which represents the wanted sequence. The goal is to select the cheapest path of the target sequence from start to end:
example:
```!
(START) (I) (have) (time) (on) (Monday) (END)
(START) (I) ( ) (time) (on) ( ) (END)
(START) (I) ( ) (time) (on) (Monday) (END)
(START) ( ) (have) ( ) (on) (Monday) (END)
Every "layer" ("I", "have", ...) has a transition to every instance of the next layer (sorry I didn't know how to do that here). So, all "I"'s would connect to every "have" and every "have" to every "time" and so on.
```
The problem in designing a corpus for unit selection synthesis, is that you need a lot of data to have larger-level units (e.g words) in context.
> Additionally, a complete coverage of diphones is essential.
This means that first of all, your corpus will be very large (and possibly expensive), and second of all your application will perform better when the domain of the content matches the domain of the corpus that was used or otherwise, it will have to back off to smaller units (e.g. diphones) to synthesize out of domain content, which could lead to a worse quality synthesis. Another problem of these large corpora is the annotation: it doesn't do any good if you have a large corpus but don't really know what's in there. Therefore a small (perfectly) annotated corpus would result in a better quality than one that is large but not (correctly) annotated.
Finally, to overcome most of the problems in unit selection synthesis it is important to be consistent in the quality of your speech samples and the speech style.
## 6. Big question *(12 P)*
### a) Describe the statistical parametric speech synthesis and emphase the different problems which you have to solve. Indicate how HMM-based TTS solves this problems.
-> http://www.cstr.ed.ac.uk/downloads/publications/2010/king_hmm_tutorial.pdf :
The model is parametric because it describes the speech using parameters, rather than stored exemplars.
It is statistical because it describes those parameters using statistics (e.g., means and variances of probability density functions) which capture the distribution of parameter values found in the training data.
-> slides Le Maguer 1-10
As most speech synthesis models, statistical parametric speech synthesis consists of two main stages, connected by the Database that stores what was learned in the first stage.
- **Offline stage**: we compute coefficients and features based on the acoustic signals and corresponding text snippets in the corpus
-- signal processing: compute the acoustic coefficients found in the signal
-- natural language processing: compute descriptive features describing the text snippet
>coefficients may be static (context, prosody etc.) or dynamic (changes of values over time)
- the mapping of descriptive features to acoustic coefficients is stored in the database
>we calculate the acoustic coefficients as such:
>L'=argmax(over L) P(y|x, L)
>where L' the final acoustic model (=mapping of descriptive features to acoustic ones)
>x=descriptive/linguistic features
>y=extracted acoustic features
>L=inefficient initial model
- **Online stage**: In order to convert the text input into speech output, compute descriptive features from it (the natural language processor used for this does not have to be the same as that used in the offline stage, but should return the same results because otherwise, the mapping of text to acoustic coefficients will obviously be different from what was learned offline). Based on the descriptive features, use the model trained offline to generate speech; afterwards, do some post processing to "smooth" the result
>calculated using y'=argmax (over y) P(y|x, L')
>generates most probable acoustic features given extracted linguistic features and the model
>All feature extraction is done in CEPSTRUM space, i.e. log scale (="Mal-frequency scale")
-> slides Le Maguer "chapter" 4
Acoustic modeling
Gaussian
Hidden Semi Markov Models
A big problem is how to do the acoustic modeling.
HMM-based TTS uses a Multi-Space distribution to model the F0, consisting of a modified Gaussian distribution (Gaussian Mixture Model) for the voiced and white noise, produced by a more simple Function δ for the unvoiced sounds (because unvoiced sounds do not have an F0 line).
Building on this F0 baseline, a HSMM (Hidden Semi-Markov Model) is then used to compute all the other features needed to build the speech
Heterogeneous data
Multi-Space Distribution
Natural Speech has voiced as well as unvoiced sounds. The important thing to know about this is that, while voiced sounds have an underlying harmonic (the F0), unvoiced sounds do only have white noise as base.
To cover this aspect, HMM-based TTS uses a Multi-Space distribution with different functions to create the voiced and unvoiced baselines.
Sparseness
Decision tree
Multi-stage training process
As with probably all statistically trained systems, there is the problem of sparse data. It is always possible that not all combinations of speech sounds have been encountered in the training phase.
HMM-based TTS solves this by "improvising" using decision trees to compute the best acoustic coefficients given the descriptive features.
### b) Compare unit-selection and statistical parametric speech synthesis: what are the differences and what are respective advantages and drawbacks of these two methods?
- **Unit selection**:
- stores the training corpus as it was in the database and during online computation "select the smallest number of the longest units covering the target phone sequence" to put them together in the output ("The best solution to the synthesizer problem is to avoid it.")
- [+] reducing perceptual impression of lack of naturalness caused by number of concatenations and signal processing by keeping the number of concatenations as low as possible
- [+] sounds pretty natural because the sounds already are natural (natural waveforms are preserved)
- [+] reduces the need for signal processing
- [+-] the quality of the training corpus has a big influence on the quality of the resulting voice
- [-] needs a big corpus to get enough examples to work well; the bigger the corpus that has to be searched online, the more computation power is needed to not delay the output too much
- [-] there is noise (glitches, other artifacts) that can be heard at each concatenation point; if the units are not big enough, the voice is pretty hard to listen to
- [-] inflexible w.r.t. speaking style and speaker voice
- **statistical parametric speech synthesis**:
- only stores the models trained on the descriptive features combined with the acoustic coefficients of the corpus in the database
- [+] as there is no concatenation, there are no glitches in between the units; the output is more like one smooth stream
- [-] there are no natural waveforms, only those computed by the model - which often leads people to classify the resulting voice as more robot-like, because there is less "vivid prosody" (TODO: find a better description for this)
### c) What is the most important equation in parametrical speech synthesis? Explain this equation and why it is so fundamental.
-> slides Le Maguer:
Observations = Windowing Matrix * Coefficients (O = W * C)
-> following https://pdfs.semanticscholar.org/ad8c/23e268e97c0e513af57a72198eb878de5627.pdf here:
- O = the speech parameter vector sequence where each element o~t~ (t being a point in time) consists of a static feature vector and two dynamic feature vectors and describees exactly which features the speech output should have at that point in "its timeline"
- W = a matrix that can be considered as constant (TODO: does this matrix somehow represent what is assumed to be a basic description of the speech sound?)
- C = the coefficients computed during training
This equation is fundamental because it is the method used to "smooth" the resulting speech. It is the mean to simulate the dynamics that natural speech has, to make sure that the speech signal is continuous and does not only consist of concatenated sounds with jumps at the concatenation points in between them.
This equation is what makes the results of the parametrical speech synthesis sound nice :).
## 7. What is concept-to-speech (CTS) synthesis? How does it differ from text-to-speech synthesis? What are the requirements for making CTS work? Which properties of synthesized speech would benefit most from CTS (as opposed to TTS)? *(4 P)*
Concept-to-speech synthesis (from Taylor 2000) uses Natural Language Generation (NLG) to make text in response to a query.
CTS uses the Phonological Structure Matching (PSM) algorithm, which works similarly to unit selection. Units can be of varying length from words^1^ down to phonemes. Larger nodes such as words are preferred to smaller nodes such as phones. The goal is to minimize the target and concatenation cost functions. The difference between PSM and unit-selection is that PSM selects on phonological criteria while the others select on phonetic and acoustic criteria. (In practice, I think this means that in PSM the smallest unit that can be selected is the phoneme rather than the diphone.) This means that the synthesizer doesn't have to worry about the exact phonetic representation of the sequence. It just has to search for units that are described by a particular phonological representation.
> ^1^ AFAIK even sentences may be used as units.
> [name=Christophe Biwer]
It differs from TTS in that there is no raw text input that needs to be synthesized, and therefore there is no text analysis.
Since the NLG system creates the text that will be synthesized, the NLG system can directly inform the speech synthesizer with the information it needs. This means that the problems of prosody, part of speech, stress, and ambiguity are all alleviated (*abgeschwächt*).
> Section 12.1.2 in `moebius_habil_aims.pdf` could also be very interesting:
> > ### Definition:
> > Concept-to-speech (CTS) synthesis enables the generation of synthetic speech from pragmatic, semantic and discourse knowledge. The idea is that a CTS system **“knows” what it intends to say, and it even knows how best to render it**.
> > ### Requirements:
> > It knows, because it generates, the complete linguistic representation of the sentence: the deep underlying structure is known; the intended interpretation may be available; and its corresponding syntactic structure is known.
> > **CTS synthesis requires knowledge and models from many linguistically relevant research areas, such as pragmatics, semantics, syntax, morphology, phonology, phonetics, and speech acoustics.** (It thus integrates several disciplines, such as computational linguistics, artificial intelligence, cognition research, signal processing.)
> > ### Benefits:
> > CTS research so far has **mainly focussed on improving the symbolic and acoustic representations of prosody**. Whence this focus on prosody? Written text in most languages contains only an imperfect and impoverished representation of prosodic features; therefore, **prosodic modeling is almost necessarily one of the weakest components in TTS systems**. Furthermore, phrasing and accenting are surface reflections of the underlying semantic and syntactic structure of a sentence.
> [name=Christophe Biwer]
>Message-to-speech
>This is a component in a larger natural language system, where the input to the synthesizer comes from an NLG system (it is thus based on the system's internal representations)
>As the input comes from an itnernal source, it knows all of the context information (POS and structure of utterance). This avoids all errors a text analysis system might make for preprocessing.
>Advantage: no text analysis needed (as there is no ambiguity)
>Disadvantage: internal representations differ from system to system --> no compatibility
>Meaning-to-speech
>Like message-to-speech, but based on semantic representations
>Easy to create prosody with it as meanings are clear
>Also has the problem of no standardised representations at hand
>Requirements:
>rigid and systematic internal representations
>Who benefits: Prosody due to lack of ambiguity
## 8. Explain, and illustrate with at least three examples, why a text normalization step that precedes all other linguistic text analysis components is doomed to fail in languages such as German or English. *(6 P)*
:::danger
:fire: not to be confused with word normalization (stemming)!
:::
### General information about text normalization
Text normalization is **part of the linguistic text analysis** and not part of preprocessing. It aims to expand alphanumerical expressions, abbreviations, acronyms and other textual and orthographic phenomena correctly. In order to do so, we need
- reliable identification of types of phenomena
- good coverage dictionaries (abbreviations, acronyms)
- local grammars (alphanumerical expressions)
- context analysis for
- disambiguation of type
- morphosyntactic agreement of expension
> *from the lecture: tts_components.pptx, p.10*
> [name=Christophe Biwer] [time=Feb 18, 2017 15:16] [color=#2b5556]
### Explain...
1. The period is ambiguous in German and English, because it delimits sentences but also marks abbreviations and occurs in numeral expressions.
>ex. "Prof. Carlson" vs. "I spoke to the Prof."
2. A string surrounded by white space (or punctuation symbols) does not always constitue a word.
3. German allows extensive compounding (e.g. *Donaudampfschifffahrt*) and complex expressions with letters, digits and other symbols (e.g. *42%*)
4. Abreviations and acronyms have to be expanded into regular words (e.g. *kg*) or spelled letter by letter (e.g. *USA*).
Performing these text normalization tasks in pre-processing steps, as it is done in conventional systems for German, [...], often leads to incorrect analyses because sufficient contex information is not available at the time at which the expansion is performed. Such context information may comprise lexical and morphological analysis of surrounding words in the text, part-of-speech assignement, or syntactic parse trees.
> *from the pdf: moebius_habil_aims.pdf, 19ff. (3.1 Generalized text analysis)*
> [name=Christophe Biwer] [time=Feb 18, 2017 15:16] [color=#2b5556]
### Illustrate...
1. "Die Konferenz soll am 22.9.1997 beginnen."
(*The conference is supposed to begin on 9-22-1997.*)
- The numerical expression has to be recognized as a date:
- *1997* should be expanded to *neunzehnhundert* (nineteen hundred) and not as *eintausend neunhundert* (one thousand nine hundred)
- the numericals representing the day and the month have to be interpreted as ordinals. Unfortunately, a conventional pre-processor would expand the ordinal numbers to their default forms (most likely the nominative singular masculine), however, the correct form (dative singular masculine) can only be found if a special grammatical constraint or language model is applied that enforces number, case, and gender agreement between the numeral expression and the preceding preposition *am*, and rules out all non-agreeing alternatives.
:::danger
Just for your information: in the slides, "Ludwig XIV" is written without period in the German version. My explanation is based on this. Afterwards I realized that it should be written with a period.
:::
2. "Ludwig XIV. ist auch dabei."
(*Ludwig XIV is also participating / present.*)
- The string *XIV* could either be pronounciated as [ksɪf] or letter by letter (as in *USA*).
- The correct pronounciation would be *der Vierzehnte* (the fourteenth). There must be context information about *Ludwig* being a king and therefore the string in capital letters following his name is his title. Another solution could be to have a dictionary for Roman numerals, but there would still be some exceptions to take care of, e.g. **CC** (*200* or *Carbon Copy* (as used in e-mails)) or **I** (*1* or *ich*).
3. "Die Fahrt kostet €8.10."
(*The ride costs €8.10.*)
- As in the first example the numerical expression has to be recognized correctly. The preceding euro sign makes it clear, that *8.10* is not a date, but a price. Similar to time indications (*8:10 Uhr*) a price is read as *[Number before the period] Euro [Number after the period]*
- Besides a system has to recognize that the sentence does not end after the first period. This could be done by checking for a white space after the period. Unfortunately, this technique could lead to problems in text with typing errors or texts that are not orthographically consistent.
> *from the lecture: tts_components.pptx, p.9 and by myself*
> [name=Christophe Biwer] [time=Feb 18, 2017 15:16] [color=#2b5556]
## 9. Is morphological analysis a necessary component of TTS for English? Which morphological phenomena or processes occur in the internal structure of English words that require analysis (rather than a full-form lexicon)? Use concrete examples to show that the morphological structure of words influences the pronunciation of these words. *(4 P)*
Uncertain. The following phenomena or processes occur:
- Inflection:
- Plural: `s`
- Singular genitive: `'s`
- => is easily distinguishable
- Derivation and Compounding
- Unlimited vocabulary
- e.g. names
- productive compounding (with whitespaces, not glued together as in German)
- => unknown words need to be morphologically analysed
- morphological processes like adding pre- and suffixes to a word can be iterated infinitely and thereby create ever new words with unknown pronounciation, where the pronounciation is unknown to the lexicon:
-
```
establish
establish-ment
dis-establish-ment
dis-establish-ment-arian
dis-establish-ment-arian–ist
dis-establish-ment-arian–ist-s
anti-dis-establish-ment-arian–ist-s
pseudo-anti-dis-establish-ment-arian–ist-s
```
> (example taken from slide set 2 from "Einführung in die Allgemeine Sprachwissenschaft)
A concrete example for the influence on the pronunciation of a word based on its morphological structure would be **"import"**. As a verb the stress lies on the second syllable "port" whereas as a noun the stress is on the first syllable "im".
Another example is the word **"read"**. As an infinitive it's pronounced as [ri:d] and in the past-tense [rɛd].
Lastly, with **"polish"** we have a case of different pronunciations where there would also be a difference in the morphological derivation tree. On the one hand we have the verb "to polish" (-> make something shiny) and (to my knowledge) it can't be dissected any further than this. On the other hand we have "Polish" (an adjective for someone or something from Poland), here we can dissect the word a bit further as its root is "Pol-" and "-ish" is a suffix that indicates an adjective.
Therefore for a correct text-to-speech-synthesis the morphological structure of the word is needed - and not only for unknown words.
> examples by [name=Insa] [time=Feb 19, 2017 21:58]
>
> *from the lecture: textanalysis.pdf, p.30ff and myself*
> [name=Christophe Biwer] [time=Feb 18, 2017 17:25] [color=#2b5556]
## 10. Languages differ with respect to the complexity of the relation between orthography and pronunciation. On a scale describing a continuum from “almost phonemic” (i.e., one-to-one relation between spelling and pronunciation) to extremely complex, where would you rank the relation between spelling and pronunciation of your native language (please specify what your native language is)? How difficult would you say is the task of writing a set of pronunciation rules for your language? Feel free to use concrete examples. Note: There is no “correct or wrong” answer here, but do motivate your assessment. *(6 P)*
On a scale from Finnish to Japanese, where Finnish is the most consistent and Japanese is a complete disaster...
```
|--------------------------------------------------------------------|
Finnish German Swedish English Japanese
Spanish Luxembourgish Danish
Chinese French
```
German: less consistent orthography-pronunciation correlation than Spanish or Finnish, but better than English, Danish, French etc.
Some reasons why it is not perfect:
- dealing with `<r>` which often becomes `[ɐ]` (SAMPA here), use simple rewrite rule `/r/ --> [ɐ]/V__[#/C]`
- diphthongs like `<äu>`, `<eu>`, `<ei>` (but those are always pronounced `[ɔʏ]` `[ɔʏ]` `[aɪ]`)
- morphological factors such as `rauch.en [raʊxn]` vs `Frau.chen [fraʊçn]`
- final devoicing: `[+plosive][+voice]-->[-voice]/__#`
- how to deal with foreign words? Garage vs. Trage... dictionary?
- etc.
--> There are inconsistencies within the orthography of German (regarding German words only), but they can be dealt with using simple rewrite rules (which may not be the case for a language like French for example!). These should take into account both context including morpheme boundaries (therefore, morphological analysis is necessary). Only problem (as in most languages): How to deal with those foreign words/names etc.?
> Japanese pronunciation is very hard to predict from the characters. Each char has multiple readings depending on the word that it is part of and there is very little phonetic information encoded in most of the writing system. Personally I would say English/French are equally bad but in different ways. English has a lot of borrowings from actually French/Latin but from other languages as well, and English didn't bother to change the writing of those borrowings even when the spelling didn't really match how the words would sound normally if read aloud. Also English has a lot of archaic spellings that used to be pronounced very differently, but are still spelled the way they used to be pronounced before the great vowel shift and other historic phonological changes, e.g. through/though/tough. French, on the other hand, is a little more consistent in how they represent individual phonemes, but French exhibits a lot of phonological changes in natural speech at word boundaries, e.g. liason/enchainement, which is hard to predict, and changes based on dialect.
>[name= Svetlana]
## 11. Describe and explain the formalism of rewrite rules. For a concrete example, use rewrite rules to account for the regular plural formation of English nouns. *(4 P)*
:::info
Syntax: `A → B / L __ R ;`
The input string **A** is re-written as (substituted by) the string **B**,
given a left context string **L** and a right context string **R**.
:::
- `→` substitution symbol
- `/` introduces the context in which the rule applies
- `__` is the slot where the substitution takes place
- `;` terminates the rule
- `L` and `R` are optional; e.g.:
`A -> B` ; is a **context-free** rule (i.e. applies always), otherwise a rule is **context-sensitive**
> *from the lecture: textanalysis.pdf, p.25*
> [name=Christophe Biwer] [time=Feb 19, 2017 12:43] [color=#2b5556]
### Supplementary explanations:
- deterministic rule approach
- define a general rule format, a rule processor or engine and a clearly defined set of rules
- engine / rule seperation
- the way the rules are processed and the rules themselves are seperate, which helps with the system maintenance, optimisation, modularity and expansion.
> *from the pdf: ttsbook_draft_2.pdf, p.103*
> [name=Christophe Biwer] [time=Feb 19, 2017 13:12] [color=#2b5556]
### Concrete example: Regular plural formation of English nouns.
`[Eps] → s / __[noun][#]`
this means that we will add an `s` (replace epsilon with `s`) if we have a string whose category is `[noun]` and that is finished (so that we won't have problems with compound words like *water-bottle*)
:::danger
this is not clear at all, Möbius never explains in his slides, what he means with `[verb]`. I guess it means that the preceding string has the given category. Similar examples may be found in `moebius_habil_aims.pdf, p.39, e.g. allomorphy rule`
:::
> *by myself*
> [name=Christophe Biwer] [time=Feb 19, 2017 13:47] [color=#2b5556]
> Plural nouns:
> `[+PL] → [s] / C[-voiced][-sibilant]__;` ([s] after unvoiced nonsibilant consonants)
> `[+PL] → [ɪz] / [+sibilant]__;` ([ɪz] after sibilants)
> `[+PL] → [z] / ELSEWHERE;` ([z] for rest)
> [name=Unknown Person]
## 12. What is the task of the duration prediction module in TTS? What is the target unit whose duration is predicted? Why is duration prediction difficult, and which solutions are available to overcome these difficulties? *(6 P)*
Task: Predicts duration of speech sound as precisely as possible, based on factors affecting duration.
Unit:
- In Unit Selection: Phone duration is manipulated based on target specification (target cost)
- In HMM Synthesis: Probability of staying in the same state
- There are also some syllable oriented models (which are not widely used though)
Difficulty:
- Extremely context dependent durations
- ex. [E]=35ms in "jetzt" vs. 252ms in "Herren"
- Factors which affect durations are many: accent status of word, syllable stress, position in utterance, segmental context... --> large feature space
Solutions:
- Rule based systems ("expert systems" with hand written rules --> impractical!)
- Sum-of-product model
- $DUR(f)=SUM(PROD(S(f)))$
- where $S$ are parameters and $f$ is a feature (see slide 29 on http://www.coli.uni-saarland.de/courses/sprachsynthese/2016_WS/slides/duration_intonation.pdf for concrete example)
- defines factors which affect duration. These are computed using a greedy algorithm, which goes through the annotated corpus, selecting the smallest subset of features to account for the same amount of events (=have the same coverage as larger models)
- > needs annotated corpora and expert knowledge to find features to begin with
- current best practice (according to slide 27)
- yields: phonetic decision tree
- General machine learning on very large annotated corpora --> problem: Sparsity of Data (Zipf's law)
## 13. What is the task of the intonation prediction module in TTS? Why is intonation modeling and F0 prediction difficult? Characterize the two major types of intonation models applied in TTS (Pierrehumbert’s model / ToBI; Fujisaki’s model). *(8 P)*
The intonation prediction module is responsible for computing the fundamental frequency $F_0$. This frequency is influenced by many factors: Where are stresses in the sentence and on syllables? Are there lexical tones (e.g. in tone or pitch accent languages)? Is the utterance a question? Where are prosodic phrases?
There are several difficulties in modeling intonation: To sound natural, the fundamental frequency should be continous. The input text typically lacks explicit information about intonation. Also, the fundamental frequency is influenced by so many factors that purely acoustic models cannot determine what the actual cause of a $F_0$ variation was.
### Phonological tone-sequence models (Pierrehumbert)
Pierrehumbert's model is probably the most influencial tone-sequence model and is based on the theory of autosegmental-metrical phonology. Tone-sequence models represent prosodic structure as a sequence of distinct tones that don't interact with each other. Here, there are two different tones: high (`H`) and low (`L`), which can also be combined to "`H+L`". Besides to tones, there are three types of accents:
- Boundary tones (`H%`, `L%`) at the edges of an intonational phrase,
- Phrase tones (`H-`, `L-`) control the pitch movement between a pitch accent and a boundary tone,
- Pitch accents (`H*`, `H*L`, `L*H`) are assigned to prosodic word and mark accented syllables.
A (regular) grammar describes how these tones can be combined.
ToBI (Tones and Break Indices) is a transcription system based on the tone-sequence model theory. It is language dependent. While it was developed and works best for American English, it has been adapted to other languages.
> *What I would add:*
> It is implemented in many TTS systems.
> Abstract tonal representation converted to $F_0$ contours by means of phonetic realization rules.
> [name=Christophe Biwer]
### Acoustic-phonetic superposition models (Fujisaki)
Fujisaki's model creates the $F_0$ by overlaying several components. It is a superpositional model. Components that are combined are a basic $F_0$ value ($F_{min}$), a phrase component, and an accent component.
> How the $F_0$ is manipulated for phrases: "Each phrase is initiated with an impulse, which when passed through the (phrase component) filter, makes the $F_0$ contour rise to a local maximum value and then slowly decay"
> [name=Anna Welker]
The model wants to simulate the human production of the fundamental frequency.
Fujisaki's model has been applied to many (different) languages.
> - slides duration_intonation, p.7ff
> - Taylor
> - chapter 9.3.3 Autosegmental-Metrical and ToBI models (pdf 259)
> - chapter 9.3.5 The Fujisaki model and Superimpositional Models (pdf 261)
> - Möbius, Bernd: German and Multilingual Speech Synthesis
> - chapter 8.1 Intonation theory (pdf p. 126)
## 14. Describe two synthesis quality evaluation methods: one test for segmental intelligibility, and one test that blocks top-down linguistic knowledge from affecting the evaluation. *(4 P)*
- textbook page 534; 31.01.2017 Slides from Sebastien
A test for **segmental intelligibility** is the Modified Rhyme Test (MRT). You select 50 groups of 6 words each, which all have the same onset but either a different vowel or a different coda, e.g. "*ban*, *bath*, *bat*". Play them for listeners, and ask the listeners to record which word they heard, usually from a multiple choice question.
An example of a **test that blocks top down linguistic knowledge** would be the Haskin sentences, which use semantically (but not syntactically) unpredictable sentences. These sentences make it very difficult to guess the next word, e.g. "*The wrong shot led the farm.*"