--- tags: M1-Internship-TER --- # Study and development of a vocal force model Keywords: : Machine learning, voice strength, speech processing, expressive speech ## Context The project aims to model the vocal force (VF) estimation on speech recordings. VF is defined as the sound pressure level (SPL in C-weighted decibels) measured in free field, one meter away in front of the speaker’s mouth (Liénard, 2019). This SPL is unfortunately lost in the vast majority of recordings, though the human ear is able to estimate this information thanks to the spectral differences produced by the variations in vocal effort induced by these VF values. <!-- Liénard (2019) also shows that estimating this magnitude with a good degree of accuracy is possible based on long-term spectral variations. --> A corpus presenting a pair of calibrated/uncalibrated signals will be used to build a model capable of estimating the original value of VF (in dBC). Collaborations under development will benefit and extend this effort by expanding the collected corpus and applying the resulting model to other tasks (e.g., expressive synthesis, Evrard et al., 2015). ## Objectives The initial aim will be to increase the variational characteristics of the uncalibrated signal from the pair provided in this corpus. In practice, it will be necessary to apply a series of degradations corresponding to the variations in distance and positioning of the speaker with respect to the microphone. Moreover, other processing will be applied, such as those typically used in post-production (compression, gate, etc.). A model will then have to be trained from this calibrated/uncalibrated pair to reproduce a reliable estimate of the original VF from any recording. Different neural architectures will be evaluated, from simple feedforward neural networks to those based on complex representations (e.g., CNN, LSTM). Different feature extraction methods will also be considered: raw, perceptually filtered (e.g., Mel) spectrums, as well as self-supervised model-based (e.g., Baevski et al., 2020). <!-- extractions. --> ## Tasks * Reviewing speech corpus augmentation techniques * Surveying learning architectures: neural and self-supervised for processing audio pairs * Augmentation of the corpus through the application of acoustic degradations * Building a model of voice strength restoration from the signal pairs * Presenting an objective evaluation of the model’s performance, as well as a subjective evaluation via perceptual experiments ## References 1. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449-12460. 2. Evrard, M., Delalez, S., d'Alessandro, C., & Rilliard, A. (2015). Comparison of chironomic stylization versus statistical modeling of prosody for expressive speech synthesis. In Sixteenth Annual Conference of the International Speech Communication Association. 3. Liénard, J. S. (2019). Quantifying vocal effort from the shape of the one-third octave long-term-average spectrum of speech. The Journal of the Acoustical Society of America, 146(4), EL369-EL375.