--- tags: M1-Internship-TER --- # Sketching Audio Sound FX Using Vocalization Keywords: : Machine Learning, Audio Generation, Sonic Icons, Foley ## Objectives The main aim is to investigate a solution to facilitate the sonic design of sound FXs by allowing to interpret users' intentions through vocalizations. To that end, the trainee will mainly need to study the state-of-the-art machine learning approaches commonly used in vocal imitation and sound synthesis. ## Context Sound FXs, going from sonic icons (i.e., UI sounds, like earcons, auditory icons, morphocons, spearcons) to more diegetic environmental sounds, are very common in games and other digital applications both to transfer information feedback and immerse the user. However, their design is time-consuming and implies browsing large sound datasets that are difficult to organize and maintain, and that are often limited in terms of data openly available. Furthermore, it is unlikely to find a perfectly matching sound in the dataset, so the sounds should be refined manually through sonic tools (e.g., DAWs), which are inaccessible to non-audio experts. To facilitate the sonic design of these sound FXs, we propose to develop a system able to interpret users' intentions through only vocalizations made by the user, and that would be able to synthesize, from a preexisting dataset, a sound that fits the vocalization(s). Former related projects often associated vocalization with another data type. These projects include the [skat-VG project](https://cordis.europa.eu/project/id/618067) that studied the sound design obtained from the combination of vocal imitations and gestures, the Onoma-to-wave project from Okamoto et al. [1, 2] that mostly works on onomatopoeic words (textual information) or vocal imitations but combined with sound event labels, and some other projects that also used images, videos or text prompt. Some vocal imitation datasets have been released in the past [3] that could be used by the trainee to try the state-of-the-art methods. <!-- FROM TIFANIE : Sound FXs, going from sonic icons (i.e. UI sounds, like earcons, auditory icons, morphocons, spearcons...) to more diegetic environmental sounds, are very common in games and other digital applications both to transfer information feedback and immerge the user. However their design is time consuming, and implies to browse big sound datasets that are difficult to organize and maintain, and that are often limited in terms of data openly available. Furthermore, it is unlikely to find the perfectly matching sound in the dataset so the sounds should be refined manually through sonic tools (e.g. DAWs) which are unaccessible to non audio experts. To facilitate the sonic design of these sound FXs, we propose to develop a system able to interpret users' intentions trough only vocalizations made by the user and that would be able to synthesize, from a preexisting dataset, a sound that fits the vocalization(s). To that end the trainee will mainly have to study the state of the art of the machine learning approaches commonly used in vocal imitation and sound synthesis. It is to notice that former related projects often associated vocalization to another type of data. For instance these projects include the skat-VG project (https://cordis.europa.eu/project/id/618067) that studied the sound design obtained from the combination of vocal imitations and gestures, the Onoma-to-wave project from Okamoto et al. [1, 2] that mostly works on onomatopoeic words (textual information) or vocal imitations but combined with sound event labels, and some other projects that used also images, videos or text prompt. Some vocal imitations datasets have been released in the past [3] that could be used by the trainee to try the state-of-the-art methods. --> <!-- Many languages are marginalized and endangered, but text-to-speech can play an essential role in preserving and revitalizing them. To carry out the project, the trainee will have to manipulate various tools for the automatic processing of written language and speech (deep neural networks in particular). Knowledge of Walloon is obviously not a prerequisite, but an interest in poorly endowed languages and the potential of new technologies will be a plus. --> ## Primary tasks * Surveying existing models and datasets * Selecting the most suitable approach * Reproducing the results with the chosen model/dataset * Possibly optimizing and evaluating the model ## References 1. Okamoto, Y., Imoto, K., Takamichi, S., Nagase, R., Fukumori, T., & Yamashita, Y. (2023). Environmental sound conversion from vocal imitations and sound event labels. arXiv preprint arXiv:2305.00302. https://arxiv.org/abs/2305.00302 <!-- 1. Liu, X., Iqbal, T., Zhao, J., Huang, Q., Plumbley, M. D., & Wang, W. (2021, October). Conditional sound generation using neural discrete time-frequency representation learning. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1-6). IEEE. https://arxiv.org/pdf/2107.09998.pdf --> 1. Okamoto, Y., Imoto, K., Takamichi, S., Yamanishi, R., Fukumori, T., & Yamashita, Y. (2022). Onoma-to-wave: Environmental sound synthesis from onomatopoeic words. APSIPA Transactions on Signal and Information Processing, 11(1). [ResearchGate](https://www.researchgate.net/profile/Keisuke-Imoto/publication/349234381_Onoma-to-wave_Environmental_sound_synthesis_from_onomatopoeic_words/links/624ae4ee57084c718b865d51/Onoma-to-wave-Environmental-sound-synthesis-from-onomatopoeic-words.pdf). <!-- 1. Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y., & Yu, D. (2023). Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. --> <!-- (Bcp de citation, mais pas exactement la même tâche) --> <!-- 1. Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., Défossez, A., Copet, J., ... & Adi, Y. (2022). Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352. --> <!-- (Bcp de citation, mais pas exactement la même tâche) --> 1. Mark Cartwright and Bryan Pardo. 2015. VocalSketch: Vocally Imitating Audio Concepts. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). Association for Computing Machinery, New York, NY, USA, 43–46. [psu.edu](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=d36d969f97aa424fcb3165327a3daf674b505fe2). <!-- 1. https://arxiv.org/pdf/2002.10981.pdf --> <!-- 1. https://www.bongjunkim.com/pages/files/papers/icassp19_Kim.pdf -->