Real-time formant analyser planning

# Real-time formant analyser planning ## Discussion Formant calculation algorithm: https://www.fon.hum.uva.nl/praat/manual/Sound__To_Formant__burg____.html > computes the LPC coefficients with the algorithm by Burg, as given by Childers (1978) and Press et al. (1992) > Each formant_j contains the following attributes: > frequency > &ensp; the formant's centre frequency (in Hz). > bandwidth > &ensp; the formant's bandwidth (in Hz). What’s bandwidth? http://glottopedia.org/index.php/Formant_bandwidth Effect of Bandwidth on Voice: https://www.scielo.br/pdf/rcefac/v11n2/26-08.pdf https://en.m.wikipedia.org/wiki/Resonance_width > Emilia Today at 9:26 AM >> the link i found seems to imply it is related to damping if the mouth is considered as a vibrating system. So probably a robot would have low bandwidth and fleshy creatures a higher one :thinking: just my guess > Lílian Today at 9:28 AM >> So, I think that's somewhat accurate, according to this research paper things that affect bandwidth are >> nasality >> breathiness >> pharyngeal constriction >> tongue height >> and aperiodicity > Emilia Today at 9:29 AM >> makes sense yea, anything that loses energy --- ## Automatic detection of vowels and their location in speech This could be useful for vowel modification practice when talking words or sentences. https://archive-ouverte.unige.ch/unige:18188 --- ## Praat's formant calculation algorithm ### Reference The Praat [manual](https://www.fon.hum.uva.nl/praat/manual/Sound__To_Formant__burg____.html) describes the algorithm with some detail. It is described below: > The sound will be resampled to a sampling frequency of twice the value of Formant ceiling, with the algorithm described at Sound: Resample.... After this, pre-emphasis is applied with the algorithm described at Sound: Pre-emphasize (in-place).... For each analysis window, Praat applies a Gaussian-like window, and computes the LPC coefficients with the algorithm by Burg, as given by Childers (1978) and Press et al. (1992). The number of "poles" that this algorithm computes is twice the Maximum number of formants; that's why you can set the Maximum number of formants to any multiple of 0.5). > >The algorithm will initially find **Maximum number of formants** formants in the whole range between 0 Hz and **Formant ceiling**. The initially found formants can therefore sometimes have very low frequencies (near 0 Hz) or very high frequencies (near **Formant ceiling**). Such low or high "formants" tend to be artefacts of the LPC algorithm, i.e., the algorithm tends to use them to match the spectral slope if that slope differs from the 6 dB/octave assumption. Therefore, such low or high "formants" cannot usually be associated with the vocal tract resonances that you are looking for. In order for you to be able to identify the traditional F1 and F2, all formants below 50 Hz and all formants above **Formant ceiling** minus 50 Hz, are therefore removed. If you don't want this removal, you may experiment with Sound: To Formant (keep all)... instead. If you prefer an algorithm that always yields the requested number of formants, nicely distributed across the frequency domain, you may try the otherwise rather unreliable Split-Levinson procedure Sound: To Formant (sl).... Press et al. (1992) above refers to their book, *Numerical recipes in C*. What actually happens above is the following. Praat processes the speech signal, doing some cleaning to it. Then is passes it to the *LPC algorithm* which estimates speech by a simple source-filter model. The LPC algorithm finds several parameters related to the speech signal, and some of these are the formants. Praat takes this list of formants, cleans them up, and reports the result. ```flow st=>start: Speech e=>end: Formants op=>operation: pre-processing op2=>operation: LPC algorithm op3=>operation: cleanup st(right)->op(right)->op2(right)->op3(right)->e ``` ### LPC algorithm LPC ([linear predictive coding](https://en.wikipedia.org/wiki/Linear_predictive_coding)) is a mathematical model for voice production, or for parametrising voices. It is not yet clear to us how LPC gives the formants. LPC is a model that predicts the current value of a time-signal based on LPC coefficients and the previous values of that time-signal. See Section 2.2 in [this book](https://books.google.com.br/books?id=136wRmFT_t8C). The summary ([section 2.2.6](https://books.google.com.br/books?id=136wRmFT_t8C&pg=PA50)) mentions > LPC is used to estimate F0, vocal tract area functions, and the frequencies and bandwidths of spectral poles (e.g. formants). The question is, how are the formants (and other vocal parameters) deduced from the LPC coefficients? The latter seems to be related directly only to predicting future acoustic signal values from past values. - Note: The summary 2.2.6 claims that LPC can be used to determine the vocal tract area function. There is no reference for it, and I didn't see it done in the text. Discussing with [Jarmo Malinen](http://math.aalto.fi/~jmalinen/research.html.en) led me to believe that it's not possible to be determined by LPC. --- ## Our Python code for formant tuning Fortune ### TODO * ~~record vowel samples for testing (pitch, vowel, different device,breathiness/no,CQ/OQ)~~ \[except breathiness and quotient] * ~~test different chunk sizes 1ms, 20ms, 100ms, 200ms, 400ms (in surfboard and praat)~~ * ~~understand why pitch bugs at higher than 260Hz~~ * ~~get the stable values (as praat) from short (10-30ms) voice clips~~ ### Findings * we got to see that the pitch algorithm (surfboard) fails after 250 ish, no idea why but it's not our code's fault * we saw that the chunks lower than 20ms can't be analysed correctly for anything * we saw that the chunks lower than 100ms can't be analysed for pitch * we noted that the surfboard algorithms throws away the first two values given by the formant estimation via LPC algorithm. See: https://github.com/novoic/surfboard/blob/master/surfboard/formants.py * in many instances however the F1 is in these values, and not in the "expected" third value; the reason for this is unknown but it is also unknown why surfboard expects it to never be there. We implemented surfboard's code ourselves to debug this, utilizing `librosa`'s LPC algorithm. * Emilia can belt〜 * Creakiness may or may not mess with the enveloping necessary for the formant estimation. See: https://www.researchgate.net/profile/Jody_Kreiman/publication/281119746_Acoustic_properties_of_different_kinds_of_creaky_voice/links/55d75bad08aeb38e8a85a866.pdf ### January 25th 2021 * we checked our implementation agains praat values * started trying to follow praat code to improve our own implementation * we learned a lot about how praat works and Emilia taught me some *maths*; in special we looked into sliding window usage in formant contour calculation in praat and the Hillenbrand paper * changed Formant estimator to use Gaussian window instead of Hamming window (Emilia calculated the necessary standard deviation), that's because it has less lateral lobes and that's good(tm)(c)(r) * messed around with the window lenght, seems like 50ms is praat's preferred lenght, our samples still perform slightly better at 20ms? * resampled audio on load to 2 times the frequency of the highest expected formant, which for females praat defines as 5500 Hz (ie sample rate = 11000 Hz) * introduced audio pre processing to increase the spectral slope (pre-emphasis): pre-emphasised-vector(i) = original-vector(i) - exp(-2 * pi * F * Δt) * original-vector(i-1) where F is the pre-emphasis frequency (50 Hz is the default in praat) and Δt is the samping period of the sound = 1 / (sampling rate) (ie, for us = 1 / 11000Hz = 0,000091 s) * messed around with the order variable from the LPC algorithm, ie, the number of poles used for spectral envelope (?). Praat recommends twice the amount of formants measured. Praat recomends measuring 5 formants for fem voice: https://www.fon.hum.uva.nl/praat/manual/FAQ__Formant_analysis.html * we saw that 10 seems indeed to be a good value for order (Hillenbrand uses 14), but there are still some extra formants in between, especially for my cursed clip, as well as some lower formants in Emilia's * we snuggled in bed :heart: ## Research articles about voice in general **[2021-02-01 Mon 14:57]** I'm starting to read the titles, abstracts and look at the pictures of Jarmo Malinen's [publication](http://math.aalto.fi/~jmalinen/research.html.en) list. - A. Hannukainen, J. Malinen, A. Ojalammi. PU-CPI solution of Laplacian eigenvalue problems. 2020, submitted. [Download from arXiv](https://arxiv.org/abs/2006.10427). - A numerical study on a new method to calculate the eigenvalues of the Dirichlet-Laplacian on a 2D or 3D domain that are on a given interval. The method seems to be suitable for parallel computing cases where communication is expensive, e.g. cloud clusters. - How related to speech? Perhaps to Helmholtz resonances of the vocal tract? They didn't mention explicit applications in the beginning of the paper. - J. Kuortti, J. Malinen, T. Gustafsson. Numerical modelling of coupled linear dynamical systems. 2019. [Download from arXiv](https://arxiv.org/abs/1911.04219). - It deals with issues arising from modelling components in a coupled physical system by linear dynamical systems. In particular ill-posed feedback loops can cause problems. An example of this kind of situation is an acoustic waveguide terminating to "an irrational impedance". It has a good description of Webster's horn equation and how it relates to acoustic pressure and volume velocity. - How related to speech? The example of an acoustic waveguide relates to the vocal tract. The issue comes from the boundary conditions at the lips, in essence modelling a tube connected to and infinite half-space with sound-hard perfectly reflecting walls. - P. Alku, T. Murtola, J. Malinen, J. Kuortti, B. Story, M. Airaksinen, M. Salmi, E. Vilkman, A. Geneid. OPENGLOT - An open environment for the evaluation of glottal inverse filtering. Speech Communications, 107, 2019, 38-47. [Download from Speech Communication](https://linkinghub.elsevier.com/retrieve/pii/S0167639318303509). - It explains OPENGLOT, whose main purpose is to provide test data for glottal inverse filtering algorithm validation and testing. The source signals, vocal tract filter and output signals are known. They are generated purely algorithmically, physical modeling, produced by a 3D printed plastic model of the vocal tract, and from live test subjects. - How related to speech? To test the GIF algorithms. **Perhaps we could use this to test our formant detection algorithm?** - A. Hannukainen, J. Malinen, and A. Ojalammi. Efficient solution of symmetric eigenvalue problems from families of coupled systems. SIAM Journal Numerical Analysis, 57(4), 2019, 1789-1814. [Download from SINUM](https://doi.org/10.1137/18M1202323). [Download from arXiv](https://arxiv.org/abs/1806.07235). - They study how to compute efficiently the lowest eigenmodes of coupled systems where only a subsystem changes. Applications e.g. imaging the vocal tract in an MRI while producing sound. The vocal tract changes but the MRI machine air column stays constant. - How related to speech? I guess this would make it faster to calculate PDE solutions in cases as in the example. E.g. a same person speaking different vowels in the same enclosed space. - L. Schickhofer, J. Malinen, and M. Mihaescu. Compressible flow simulations of voiced speech using rigid vocal tract geometries acquired by MRI. Journal of the Acoustical Society of America 145(4), 2019, 2049-2061. [Download from JASA](https://doi.org/10.1121/1.5095250). - Studies the aerodynamics of vowel production. - *The goal of this study is an accurate computation of thewhole audible frequency range of voiced speech under articulation of natural vowels. This requires resolution of boththe harmonic component due to the regular pressure fluctuations by glottal modulation, as well as the inharmonic broadband component due to flow instabilities and turbulence, which is typically neglected by traditional models of voiceacoustics based on wave equations. In order to capture possible high-frequency cross-modes, 3D MRI geometries areapplied.* - How related to speech? It strives to understand how turbulent flow affect frequencies over 2k in the produced speech. They have good pictures which reveal which formants are agitated in which parts of the vocal tract. Formants that have low frequencies tend to be amplified at large parts of the vocal tract, e.g. laryngopharynx or oropharynx. Formants with higher frequencies tend to be amplified at smaller parts. Note that even though it's only one sound, it is composed of a sum of eigenmodes which correspond to resonances (I guess). So a particular vowel's F1 corresponds to an eigenmode, it's F2 corresponds to another eigenmode etc… Where the maximum of these eigenmodes are seems to determine whether the formant has high or low frequency. - **Interesting:** JASA suggested [this](https://asa.scitation.org/doi/10.1121/1.5095409) for reading. a) A very weird lipstick on page 1968. b) Starting on the same page they mention a way to apparently estimate the vocal tract length from vowel recordings.