--- tags: M1-Internship-TER --- # Implementation of prosodic control in a speech synthesis system for a low-resource language Keywords: : Machine learning, speech synthesis, low resource languages, Nigerian Pidgin (Naija) ## Objectives The aim of this position is to assist in the completion and application of a text-to-speech (TTS) system developed by previous interns. The system is based on the [FastSpeech 2](https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/articles/fastspeech-2-fast-and-high-quality-end-to-end-text-to-speech/) architecture and was trained on a corpus of spontaneous Nigerian Pidgin (Naija) speech. The primary goal will be to assist researchers in applying this system within an experimental context. Experiments will involve manipulating the prosody of synthesized speech. Prosody, sometimes used synonymously with intonation, refers to variations in pitch ($f_0$), duration, and intensity that occur throughout natural speech. Prosody can play a wide variety of functions in the world's languages, including the communication of attitudes and emotional states, emphasizing certain pieces of information, and distinguishing questions from statements. Many languages also use prosody to contrast between individual words, such as Mandarin Chinese's use of pitch to distinguish between 车 *chē* 'car' and 彻 *ché* 'thorough' or Spanish's use of stress to distinguish between *hablo* '(I) speak' and *habló* '(he) spoke'. Concretely, the student will help researchers adapt the architecture of the TTS model to facilitate control over prosody using various inputs. By the end of the TER, they will have applied these adaptations in a series of basic perceptual experiments to shed light on the role of tone and intonation in this language. ## Context This work is part of a larger project to study Nigerian Pidgin. It is a large but under-resourced language that increasingly serves as the primary vernacular language of Africa’s most populous country. Once stigmatized as a “broken” variety of English spoken only by the uneducated, Nigerian Pidgin is now a source of pride for many speakers who view it as a home-grown vehicle for communication. It transcends class and ethnicity, lacking the tribal associations of indigenous languages and the colonial baggage associated with English. The language can now be seen and heard in college campuses, houses of worship, advertisements, Nigerian expat communities, and even on a local branch of the [BBC](https://www.bbc.com/pidgin). ## Primary tasks * Reviewing the TTS system architecture and understanding the prosodic control mechanisms * Implementing changes to the architecture allowing for fine-grained prosodic control * Applying these modifications in a series of basic experiments ## References 1. Chien, C. M., Lin, J. H., Huang, C. Y., Hsu, P. C., & Lee, H. Y. (2021). Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8588-8592). IEEE. https://arxiv.org/abs/2006.04558 1. Tan, X., Qin, T., Soong, F., & Liu, T. Y. (2021). A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561. https://arxiv.org/abs/2106.15561 2. Ning, Y., He, S., Wu, Z., Xing, C., & Zhang, L. J. (2019). A Review of Deep Learning Based Speech Synthesis. Applied Sciences (2076-3417), 9(19). https://www.mdpi.com/2076-3417/9/19/4050 3. Bigi, B., Caron, B., & Abiola, O. S. (2017). Developing resources for automated speech processing of the African language Naija (Nigerian Pidgin). In 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (pp. 441-445). https://hal.archives-ouvertes.fr/hal-01705707/document