# Speech and Audio Processing
## Introduction
Speech processing is an application of **DSP** (Digital Signal Processing) to the analysis and processing of audio or speech recordings.
To produce speech the talker first formulates thought in their mind which then is transmitted to vocal cords where a message is converted into a sequence of phonemes enriched with **prosody** which denotes **duration** of sounds, the loudness of sounds and **pitch** associated with the sounds.
Speech sounds are divided into two broad classes which depend on the role of the vocal cords in speech production mechanism:
- **voiced speech** - sounds that production requires vocal cords to vibrate and play an active role in sound production. Example _a_, _e_, _i_.
- **unvoiced speech** - speech is produced when vocal cords are inactive. Example _f_, _s_.
**Voiced speech** occurs when air flows through the vocal cords into the vocal tract in discrete "puffs" rather than as a continuous flow. The vocal cords vibrate at a particular frequency, called the **fundamental frequency** of the sound.
- 50 - 200Hz for male speakers
- 150 - 300Hz for female speakers
- 200 - 400Hz for child speakers
**Unvoiced speech**
For unvoiced speech, the vocal cords are open and air flows continuously through them in a turbulant manner. High-frequency components characterize it.
**Formants** are resonant frequencies of vocal tracts. For voiced speech, the magnitude of lower formants frequencies is larger than a magnitude of higher formants frequencies.
## Linguistics
Speech can be seen as a hierarchical organization of elementary units of increasing time scale. At the lowest level are phones which are a physical manifestation of phonemes. **Phones** are spoken while **phonemes** belong to the written language. Next in the hierarchy are **syllables** formed out of phones then words which are formed from syllables and **utterances** consisting of words. Which is the smallest unpaused act of speech produced by one speaker, as delineated by clear pauses (or changes of the speaker).
![](https://i.imgur.com/IVL0T2a.png)
Phones classes:
- **Vowels** - a vowel is a syllabic speech sound pronounced without any structure in the vocal tract. Vowels in English: _a_, _e_, _i_, _o_, _u_ and _y_.
- **Consonants** - consonant is a speech sound that is articulated with a complete or partial closure of the vocal tract. Those are the sounds that are not classified as a **vowels**. Example: p, t, k, l, m, n, f, d, g etc. We can distinguish the follwoing types of consonants.
- **Nasal sounds** - Sound radiated from nostrils as well as lips. Example: _m_, _n_,_ing_.
- **Plosive sounds** - Build up of pressure behind closure, sudden release. Example: _p_, _t_, _k_
There are many systems of phonetic notation two of the most coomonly used are IPA and ARPABET.
- **IPA** (International Phonetic Alphabet) - The general principle of the IPA is to provide one letter for each distinctive sound. Among the symbols of the IPA, 107 letters represent consonants and vowels, 31 diacritics are used to modify these, and 17 additional signs indicate suprasegmental qualities such as length, tone, stress, and intonation.
- **ARPABET** - IPA was not always suitable for computer applications as it consisted of characters that were outside of the ASCII character set and therefore mapping from IPA to ASCII was introduced. It was called ARPABET. ![](https://i.imgur.com/3fw7yf8.jpg)
Another important concept is coarticulation. **Coarticulation** in its general sense refers to a situation in which a conceptually isolated speech sound is influenced by, and becomes more like, a preceding or following speech sound. It is caused by an imperfect human speech apparatus.
## Speech Analysis
Speech analysis is the process of analyzing the speech signal to obtain relevant information from the signal in a more compact form than the speech signal itself. The product of speech analysis are signal features such as pitch,loudness or F0. Some of them such as loudness are easily observable in the raw signal while the energy is hard to predict observing solely the raw signal.
### Representations
**Waveform** - Speech signals are sound signals, defined as pressure variations travelling through the air. These variations in pressure can be described as waves and correspondingly they are often called sound waves.
![](https://i.imgur.com/t8KhqpI.png)
If an audio signal has been recorded by a microphone and converted into a digital form consisting of a sequence of numbers, which represent the relative time pressure in given moments we call this representation a **Pulse Code Modulation (PCM)**. The accuracy of this representation is defined by two features:
- **sample rate** - stands for the frequency at which measure of the pressure was taken. If sample rate is 20kHz it means that value of the pressure is measured 20 000 times per second. Sampling is classic topic of signal processing. Here tho most important aspect is Nyquist frequency which is half the sampling rate $F_{s}$ and defines the upper end of the largest bandwidth which can be uniquely represented. For example sampling rate of 1000 Hz allows us to record only frequencies up to 500 Hz. Very important in sampling rates is to choose a sampling rate that will be able to record formants. Formants usually are contained within the range from 300 Hz to 3500 Hz. Therefore the sampling rate should not be below 8000 Hz. Some consonants for example /s/ contain big chunk of its energy above 4 kHz. As a result there might be significant differences between the sound of /s/ in audio signal recorded with a sample rate of 8000 Hz and audio signal recorded with a sample rate of 16000 Hz.
| **Name** | **Bandwidth** |
|---------------|-------------------|
| Narrowband | 300 Hz to 3.3 kHz |
| Wideband | 50 Hz to 16kHz |
| Superwideband | 50 Hz to 16 kHz |
| Fullband | 50 Hz to 22 kHz |
| Original | 0 Hz to 22050 Hz |
- **dynamicity of the sound** - more dynamic sound won't have equally good representation as less dynamic sound using the same encoding. It is directly connected with a sample rate, a low sample rate won't represent fast-changing sound as good as a high sample rate. This dependency is presented in the plots below.
| ![space-1.jpg](https://i.imgur.com/GTwWJeb.png) | ![](https://i.imgur.com/PpbRJli.png) |
|:--:| :--: |
| *Low sampling rate* | *High sampling rate* |
As you can see low sampling rate will encode sound in much worse quality than a high sampling rate.
- **bit depth** - is a number of bits that are used to save one sample. To represent sound on a discrete machine such as a computer we need to discretize analogue signal into a sequence of values. Bit depth can be thought of as a number of values that we can map the actual value of our analogue signal. A limited number of bits can only encode a limited number of values on the Y-axis. Thus the higher is bit depth the better is sound quality. Into more discrete values we can map the real value of the signal thus achieving a lower disparity between real-life signal and encoded signal.
| ![space-1.jpg](https://i.imgur.com/DFiBsPp.png) | ![](https://i.imgur.com/LtaP6ss.png) |
|:--:| :--: |
| *bit depth = 2* | *bit depth = 3*|
#### Quantization
We need a method to convert an analogue signal into a discrete signal and that is what quantization is. It is a method which maps a large, infinite set of values, in the case of audio real values, to a smaller set of values which can be expressed using a limited number of bits. The most common usead are two types of quantization linear and logarithmic.
- **linear quantization** is calculated with the following formula $x'$ is a discrete signal, $x$ is an analogue signal and $Δq$ is the step size.
$$ x'=Δq⋅round(x/Δq)$$
The quantization steps size $Δq$ has to be then chosen such that y remains in the range which can be represented by a given bit depth to avoid numerical overflow. For bit depth 16, such $Δq$ has to be chosen that all possible values are within the range $(−2^{15},2^{15}]$.
| ![space-1.jpg](https://i.imgur.com/ygkzzha.png) |
|:--:|
| *Quantised sinusoidal signal. original signal (blue) and signal after quantization (red), discrete signal. $Δq=4$* |
- **logarithmic quantization** - was introduced in order to encode wave values more accurately. Linear encoding divides the space of values into same-sized parts. Thus values that are very rare are encoded with the same accuracy as the most frequent values. But thanks to logarithmic quantization we can make some trade off and encode more frequent values more accurately while encoding rare values with lesser precision. In the short distribution of possible values is denser in frquently used ranges. Logarithmic quantization can be calculated with the following formula.
$$x'=sign(x)⋅exp[Δq⋅round(log(|x|)/Δq)]$$
But usage of logarithm causes that small values of $x$ will cause the intermediate value to go negative infinity $y=−∞$. In order to solve that problem $log$ was replaced with **mu-law algorithm**:
$$F(x)=sign(x)\frac{log(1+μ|x|)}{(1+μ)}$$
Substituting intermediate expression $round(log(|x|)/Δq)$ with $F(x)$ we get:
$$x'=sign(x)⋅exp[Δq⋅sign(x)\frac{log(1+μ|x|)}{(1+μ)}]$$
| ![space-1.jpg](https://i.imgur.com/BFdtDAE.png) | ![](https://i.imgur.com/bhGy6bo.png) |
|:--:| :--: |
| *Logarithmic quantization without mu-law.* | *Logarithmic quantization wit mu-law.*|
#### Wav files
The most typical format for storing sound signals is the wav-file format. It is basically merely a way to store a time sequence, with typically either 16 or 32-bit accuracy, as integer, mu-law or float. Sampling rates can vary in a large range between 8 and 384 kHz. The files typically have no compression (no lossless nor lossy coding), such that recording hours of sound can require a lot of disk space. For example, an hour of mono (single channel) sound with a sampling rate of 44.1kHz requires 160 MB of disk space.
### Speech features
Speech as any other signal have features that may be extracted Some of those features can be sucjectivly mesaured using solely raw audio signal while others need to be extracted with mathematical methods.
#### Prosody
Prosody aka. accent - suprasegmental phenomena in speech refer to those patterns in a speech that takes place at time scales larger than individual phones (segments). Many of the suprasegmental phenomena, such as intonation, stress, and rhythm, play a linguistic function, hence providing an additional means to alter the meaning and implications of the spoken message without changing the lexical and grammatical structure of the sentence. Others are related to other information encoded in speech, such as cues for the speaker's emotional state, attitude, or social background. Here we only focus on those aspects of suprasegmentals that play a linguistic role.
#### Pitch
Pitch, in speech, is the relative highness or lowness of a tone as perceived by the ear, which depends on the number of vibrations per second produced by the vocal cords. Pitch frequency is the fundamental frequency of the speech signal, and formant frequencies are essentially resonance frequencies of the vocal tract. These frequencies vary among different persons and words, but they are within a certain frequency range. In short, pitch is the range of frequencies that your body is producing while you speak.
#### Loudness and decibel scale
That attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud. It is a subjective perception of sound pressure. Often loudness is confused with **SPL** which is a physical measure that is expressed in decibels.
| ![](https://i.imgur.com/zzOFRni.png) |
|:--:|
| *This plots depitcs loudness and SPL relation. (The phon is a logarithmic unit of loudness level)* |
Loudness can also be expressed with dBA which adjusted dB to the sound sensation of humans. For example, a 100 dB level at 100 Hz will be perceived to have a loudness equal to only 80 dB at 1000 Hz.
**SPL** is a **Sound Pressure Level** is a logarithmic measure of the effective pressure of a sound relative to a reference value. It is expressed with decibels.
$$L_p = 20{log}_{10}(\frac{p}{p_0})$$
**$p$** is the root mean square sound pressure
**$p_0$** is the reference value.
Exemplary values of Loudness for human speech.
| Sound | value |
|:--:|:--:|
| Maximum shout | 90 dBA |
|Shout| 84 dBA|
| Very loud | 78 dBA|
| Loud |72 dBA|
| Raised |66 dBA|
|Normal| 60 dBA|
|Relaxed|54 dBA|
#### Fourier Transform and duality of a signal
A signal can be represented in the time domain as well as the frequency domain. Both of those ways are equal representations of the same signal the only differences are the signal features that can be observed in each of the representations. In the time domain magnitude of the signal changes with respect to time while in the frequency domain signal changes with respect to frequency. In order to convert from the time domain to the frequency domain, we use the Fourier transform, in order to change from the frequency domain to the time domain we use Inverse Fourier Transform. Fourier transform is basically splitting the input signal into partial signals which are cosinusoidal. Below the formulas are presented.
**Forward Fourier transform** (time domain ➝ frequency domain)
$$X(\omega ) = \int\limits_{ - \infty }^{ + \infty } {x(t){e^{ - j\omega t}}dt}$$
**Inverse Fourier transform** (frequency domain ➝ time domain)
$$x(t) = {1 \over {2\pi }}\int\limits_{ - \infty }^{ + \infty } {X(\omega ){e^{j\omega t}}d\omega }$$
|![](https://i.imgur.com/ANKoLXi.png)|![](https://i.imgur.com/ZuFNI6S.png)|
|:--:|:--:|
|*Signal in time domain consisting of two components, $sin(3{\pi}x) + 0.5sin(2{\pi}x)$.*| *Signal in frequency domain, there are as well two components that respond to $sin(3{\pi}x)$ and $0.5sin(2{\pi}x)$.*|
For discrete signals, we use **Discrete Fourier Transform**, **DFT** which is the same as Fourier Transform but for discrete signals thus integrals are replaced with Sigmas.
$$X_k = \sum^{N-1}_{n=0}x_ne^{\frac{-i2\pi}{N}kn}$$
where n is number of the sample.
But what if we want to analyze in the frequency domain only a fragment of the signal. We would have to cut a segment out of our signal and use Fourier Transform on it, but it isn't that easy due to the phenomenon called **spectral leakage**.
|![](https://i.imgur.com/J6h8z6l.png)|![](https://i.imgur.com/8FGt3T9.png)|
|:--:|:--:|
|*That is a fragment of the signal that we want to put through FFT.*|*That is how FFT "sees" the whole signal based on the small fragment. FFT then tries to approximate a signal (green line as well) that abruptly ends, with components that in fact do not exist in the signal. Those components are the direct result of so-called spectral leakage.*|
In order to minimize spectral leakage, we can use **window functions**. In signal processing and statistics, a window function is a mathematical function that is zero-valued outside of some chosen interval, normally symmetric around the middle of the interval, usually near a maximum in the middle, and usually tapering away from the middle. Window function is constructed in such a way to minimize the spectral leakage.
| ![](https://i.imgur.com/4UylAWv.png) |![](https://i.imgur.com/LwjZA4z.png)|
|:--:|:--:|
|*Hann window function.*|*Above is an example of how spectral leakage looks like. If there wasn't spectral leakage whole signal would consist only of the main lobe.*|
Windowing is used to cut a signal into time fragments (temporal segments) but keep its statistical properties. We can use the window function while calculating the Fast Fourier transform in order to do that we need to multiply our signal function with the window function.
**Forward Fourier transform with windowing** (time domain ➝ frequency domain)
$$X(\omega ) = \int\limits_{ - \infty }^{ + \infty } {w(t)x(t){e^{ - j\omega t}}dt}$$
where $w(t)$ is window function and $x(t)$ is signal function.
#### Spectrograms and STFT
Another problem with frequency domain representation is the lack of temporal information meaning that we do not know what changes occurred and when those changes occurred. We only get information on what are the partials of a given signal and what is a frequency composition. In order to acquire information on the signal change in time, we use something called **Short-Term Fourier Transform (STFT)**.
STFT formula
$$STFT\{x_n\}(h,k) = X(h,k) = \sum_{n=0}^{N-1} x_{n+h} w_n e^{-i2\pi\frac{kn}N}.$$
In that method, we use the window function to shift over the signal and calculate Fourier Transform for every time fragment of our signal. As a result, we get something called **spectrogram**. A spectrogram is a plot of the spectrum for every time fragment cut out by window function.
| ![](https://i.imgur.com/A11bp1B.png) |
|:--:|
|*From the above spectrogram we can learn 3 information how the signal changed in time what was the frequency of its components and what was the magnitude of every component.*|
It is very important to properly adjust a window length. The wider the window the more information on spectral composition we get and less information on a change in time because the time resolution of the spectrogram decreases.
There is one kind of spectrogram that is especially useful in speech processing and it's called Mel spectrogram. Mel Spectrogram is a spectrogram where frequencies are convereted into mel scale. Mel scale is a scale where an equal distances in pitch sounds equally distant to the listener.
|![](https://i.imgur.com/vOic0zW.gif)|
|:--:|
|Mel scale|
#### Energy
By signal energy, we usually mean the variance of the signal, which is the average squared deviation from the mean.
$$Energy(x)=var(x)=E[(x-\mu)^2]$$
where $\mu=E[X]$, it is mean from $X$.
Since the amplitude of an oscillating signal varies through the period of the oscillation, it does not usually make sense to estimate the instantaneous energy, but only averaged over some window. Observe however that the windowing function reduces the average energy (it multiplies the signal by a quantity smaller than unity), which introduces a bias that should be corrected if an estimate of the absolute energy is required. Usually, however, the bias is consistent throughout a dataset and can be ignored.
Energy can be calculated over spectral bands, often called energy bands, that is, a range of frequencies of a time-frequency transform, such as 0 to 1000 Hz, 1000 Hz to 2000 Hz and so on for 1 kHz bands. Observe that the bands should be wide enough that they have a "large" number of frequency components within them such that the variance can be estimated. Such a representation is equivalent with a spectral envelope model of the signal.
##### Power spectrum
Power spectrum sometimes referred to as a PSD (Power Spectral Density) represents the proportion of the total signal power contributed by each frequency component. It can be treated as a distribution of power over frequency components. In order to get PSD, we can use a periodogram which is an estimate of spectral density or we can calculate the power of every component using the given formula:
$$P_x=\frac{1}{2N+1}\sum_{n=0}^{N}|x(n)|^2$$
**Spectral envelope** is the shape of the power spectrum of sound. It is an important cue for the identification of sound sources such as voices or instruments, and particular classes of sounds such as vowels. In everyday life, sounds with similar spectral envelopes are perceived as similar: we recognize a voice or a vowel regardless of pitch and intensity variations, and we recognize the same vowel regardless of whether it is voiced (a spectral envelope applied to a harmonic series) or whispered (a spectral envelope applied to noise).
#### F0, fundamental frequency
**Fundamental frequency** denoted as $F_0$ and called sometime fundamental is defined as the **lowest frequency of a periodic waveform**. In music, the fundamental frequency is the musical pitch of a note that is perceived as the lowest partial present. Since the fundamental is the lowest frequency and is also perceived as the loudest, the ear identifies it as the specific pitch of the musical tone. The individual partials are not heard separately but are blended together by the ear into a single tone. Since it originates from an organic structure, it is not exactly periodic but contains significant fluctuations. In particular, amount of variation in period length and amplitude are known respectively as **jitter and shimmer**. Moreover, the F0 is typically not stationary, but changes constantly within a sentence. In fact, the F0 can be used for expressive purposes to signify, for example, emphasis and questions.
| ![](https://i.imgur.com/WUoSYdC.png) |
|:--:|
| *Blue, red and green signals are partials of black signal. Orange indicates the period of the fundamental frequency.* |
| ![](https://i.imgur.com/8A57j1k.png) |
|:--:|
| *On this plot of speech signal $F_0=93$ Hz. We can check it. Lets take number od periods of the given signal in first 50 ms which is around 4.5 then we need to calculate value of one period which is $T=50ms/4.5 = 11.1$ ms$= 0.0111$ s and get inverse of $0.0111$ ms (using $f=\frac{1}{T}$) which is around 90 Hz.* |
#### Jitter and shimmer
The speech production system is not a rigid, mechanical machine, but composed of an assortment of soft-tissue components. Therefore, although parts of a speech signal might seem stationary, there are always small fluctuations in it, as vocal fold oscillation is not exactly periodic. Variations in signal frequency and amplitude are called jitter and shimmer, respectively. Jitter and shimmer are acoustic characteristics of voice signals, and they are caused by irregular vocal fold vibrations. They are perceived as roughness, breathiness, or hoarseness in a speaker’s voice. All-natural speech contains some level of jitter and shimmer, but measuring them is a common way to detect voice pathologies. Personal habits such as smoking or alcohol consumption might increase the level of jitter and shimmer in voice. However, many other factors can have an effect as well, such as loudness of voice, language, or gender. As jitter and shimmer represent individual voice characteristics that humans might use to recognize familiar voices, these measures could even be useful for speaker recognition systems.
| ![](https://i.imgur.com/xjQMfGf.png) |
|:--:|
|*Jitter are fluctuations in period while shimmer are fluctuaions in amplitude.*|
We can measure jitter by calculating absolute jitter. Absolute jitter measures the average absolute difference between the consecutive periods.
$$J_{absolute}=\frac{1}{N-1}\sum^{N-1}_{i=1}|{T_i-T_{i+1}|}$$
And relative jitter.
$$J_{relative}=\frac{J_{absolute}}{\frac{1}{N}\sum^{N}_{i=1}{T_i}}$$
where, $T_{i}$ are the extracted $F_{0}$ periods and $N$ is the number of extracted $F_{0}$ periods.
Shimmer is measured using the following formula and expresses the average absolute base-10 logarithm of the difference between the amplitudes of consecutive periods
$$S =\frac{1}{N-1}\sum^{N-1}_{i=1}{|20log(A_{i+1}/A_{i})|}$$
where $A_{i}$ are the extracted peak-to-peak amplitude data and N is the number of extracted fundamental frequency periods.
#### Zero crossing rate
By looking at different speech and audio waveforms, we can see that depending on the content, they vary a lot in their smoothness. For example, voiced speech sounds are more smooth than unvoiced ones. Smoothness is thus an informative characteristic of the signal.
A very simple way of measuring smoothness of a signal is to calculate the number of zero-crossing within a segment of that signal. A voice signal oscillates slowly - for example, a 100 Hz signal will cross zero 100 per second - whereas an unvoiced fricative can have 3000 zero crossing per second.
$$Z(i)=\frac{1}{2WL}{\sum}_{n=1}^{WL}∣sgn[xi(n)]-sgn[xi(n-1)]∣$$
| ![](https://i.imgur.com/1JlgbKk.jpg)|
|:--:|
|*Histogram of standard deviation of zero crossing rate for music and for speech. Bayesian networks based on this feature of signal can achieve up to 80% of accuracy in classification task music vs speech.*|
## Speech processing
### Dynamic Range compression
Dynamic range compression or simply compression is an audio signal processing operation that reduces the volume of loud sounds or amplifies quiet sounds, thus reducing or compressing an audio signal's dynamic range. Compression is commonly used in sound recording and reproduction. There are two methods of compression first is downward compression and the second is upward compression.
|![](https://i.imgur.com/W8Ik9nO.png)|![](https://i.imgur.com/4UCECGj.png)|
|:--:|:--:|
|*Downward compression*|*Upward compression*|
In speech processing, dynamic range compression is usually used in order to balance quite parts of speech and loud parts of speech.
| ![](https://i.imgur.com/SHdIxwZ.png)|
|:--:|
|*Diagram of a compressor. Value of gain depends on the value of the input signal and is taken from static characteristic.*|
### Noise reduction and denoising
When using speech technology in realistic environments, such as at home, office or in a car, there will invariably be also other sounds present and not only the speech sounds of desired speaker. There will be the background hum of computers and air conditioning, cars honking, other speakers, and so on. Such sounds reduces the quality of the desired signal, making it more strenuous to listen, more difficult to understand or at the worst case, it might render the speech signal unintelligible. A common feature of these sounds is however that they are independent of and uncorrelated with the desired signal.
That is, we can usually assume that such noises are additive, such that the observed signal y is the sum of the desired signal x and interfering noises v, that is, y=x+v. To improve the quality of the observed signal, we would like to make an estimate xˆ of the desired signal x. The estimate should approximate the desired signal x≈xˆ or conversely, we would like to minimize the distance d(x,xˆ) with some distance measure d(,.,).
#### Classical methods
##### Spectral substraction
The STFT spectrum of a signal is a good domain for noise attenuation because we can reasonably safely assume that spectral components are uncorrelated with each other, such that we treat each component separately. In other words, in the spectrum, we can apply noise attenuation on every frequency bin with scalar operations, whereas if the components would be correlated, we would have to use vector and matrix operations.
In spectral substraction we assume that we know the estimatation of noise energy. Then we substract noise energy ($E[|v|^2]=\sigma^2_v$) from signal energy.
$$|\hat x|^2 := |y|^2 - \sigma_v^2$$
Ofcourse energy cannot be negative, so in case of negative results energy should be zero. For signals where energy of noise is much smaller then energy of the signal ($v \ll x$) we can use following formula to calculate noiseless spectra.
$$\hat x := \frac{y}{|y|} \sqrt{ |y|^2 - \sigma_v^2} = y \sqrt{\frac{|y|^2 - \sigma_v^2}{|y|^2}}$$
##### Wiener filtering
In signal processing, the Wiener filter is a filter used to produce an estimate of a target random process by filtering of an observed noisy process, assuming known stationary signal and noise spectra, and additive noise. The Wiener filter minimizes the mean square error between the estimated random process and the desired process. If we assume that target signal can be achieved by $\hat x = y\cdot g$ where $g$ is a scalar scaling coefficient. We can derive a formula for the $g$ which gives the smallest error for example minimum mean square error (MMSE). Specifically, the error energy expectation is
$$E\left[|e|^2\right] = E\left[|x-\hat x|^2\right] = E\left[|x-gy|^2\right]= E\left[|x|^2\right] + g^2 E\left[|y|^2\right] - 2g E\left[xy\right].$$
If we assume that noise and speech signal are uncorrelated then $E[xv]=0$.
$$E[xy]=E[x(x+v)]=E[|x|^2]$$
and
$$E\left[|e|^2\right]
= E\left[|x|^2\right] + g^2 E\left[|y|^2\right] - 2g E\left[|x|^2\right]
= (1-2g)E\left[|x|^2\right] + g^2 E\left[|y|^2\right].$$
The minimum is found by setting the derivative to zero because $E[|e|^2]$ with respect to $g$ is a quadratic function that $g\rightarrow +\infty \Rightarrow f(g) \rightarrow +\infty$.
$$0 = \frac{\partial}{\partial g}E\left[|e|^2\right]
= -2E\left[|x|^2\right] + 2 g E\left[|y|^2\right]$$
after some transformation
$$g = \frac{E\left[|x|^2\right]}{E\left[|y|^2\right]} = \frac{E\left[|y|^2\right]-\sigma_v^2}{E\left[|y|^2\right]}$$
finally
$$\hat x := y \left(\frac{|y|^2 - \sigma_v^2}{|y|^2}\right)$$
#### Masks and power spectra
As seen above, we can attenuate noise if we have a good estimate of the noise energy. However, actually, both the spectral subtraction and Wiener filtering methods use models of the speech and noise energies, their ratio. The SNR (signal to noise) as a function of frequency and time is often referred to as a mask. It is however important to understand that mask-based methods are operating on the power (or magnitude) spectrum and thus do not include any models of the complex phase. To model speech signals, we can begin by looking at spectral envelopes, the high-level structure of the power spectrum. It is well-known that the phonetic information of speech lies primarily in the shape of the spectral envelope, and the lowest-frequency peaks of the envelope identify the vowels uniquely. In other words, the distribution of spectral energy varies smoothly across the frequencies. This is something we can model and use to estimate the spectral mask. Similarly, we know that phonemes vary slowly over time, which means that the envelope varies slowly over time. Thus, again, we can model envelope behaviour over time to generate spectral masks.
A variation of such masks is known as binary masks, where we can set, for example, that the mask is 1 if speech energy is larger than noise energy, and 0 otherwise. Clearly this is equivalent with thresholding the SNR at unity (which is equivalent to 0 dB), such that an SNR-mask can always be converted to a binary mask, but the reverse is not possible. The benefit of binary masks is that it simplifies some computations.
#### Machine learning methods
There are two approaches that use machine learning methods in noise reduction tasks. First of those method is to take usually simple neural network and big database of noise and speech signals and teach network to minimize the distance between output speech signal and input noisy signal. With big databes we can generate large number of unique noisy samples. However, this approach may lead to inability of the model to correctly denoise samples of speech or noise that it wasn't trained on. If database consist of adult male voices our model may not have good performence on noisy girl speech. Therefore we can use GAN, where we have a generative network which generates noises and an enhancement network which attenuates noises which corrupt speech.
# Evaluation of speech
There are a few metrics that help evaluate generated speech as well as automatic speech recognition. An evaluation of the generated speech MUSHRA and MOS are used. Those are Subjective metrics most commonly used in TTS performance evaluation.
- **MOS** stands for **Mean Opinion Score** - The MOS is expressed as a single rational number, typically in the range 1–5, where 1 is the lowest perceived quality, and 5 is the highest perceived quality. Other MOS ranges are also possible, depending on the rating scale that has been used in the underlying test.
- **MUSHRA** - stands for **Multiple Stimuli with Hidden Reference and Anchor** and is a methodology for conducting a codec listening test to evaluate the perceived quality of the output from lossy audio compression algorithms. The main advantage over the Mean Opinion Score (MOS) methodology (which serves a similar purpose) is that it requires fewer participants to obtain statistically significant results. This is because all systems are presented at the same time, on the same samples (sentences), so that a paired t-test can be used for statistical analysis.
Moreover, there are some objective metrics such as SNR and WER.
**SNR** - **signal noise ratio** measures how noisy is the audio signal, the higher the better.
**WER** - **word error rate**, closely related to ASR is the main metric for ASR quality evaluation. It simply indicates what percentage of words were incorrectly detected in the speech.
## SOURCES:
Main source (well written): https://wiki.aalto.fi/display/ITSP/Introduction
Mel spectrograms: https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53
PSD: https://vru.vibrationresearch.com/lesson/calculating-psd-time-history/
Window function: https://en.wikipedia.org/wiki/Window_function
Pressure of speech: https://service.shure.com/s/article/typical-sound-pressure-levels-of-speech?language=en_US
Zero crossing rate: https://www.sciencedirect.com/topics/engineering/zero-crossing-rate
Fourier transform: https://docs.scipy.org/doc/scipy/tutorial/optimize.html