Chapter 2 Related Work

###### tags: `v1` `thesis` `jptw`###### tags: `v1` `thesis` `jptw` # Chapter 2 Related Work :::info Keywords | `audio retrieval` `acoustic feature` `semantic feature` `sound event detecion` ::: Since the proposed QBEAT system focuses on query-by-example method, we would overview the previous work on audio retrieval techniques. ### Historical Method The conventional and heuristic way to retrieve audio information is to use cross-correlation method. In `[2]adrian2014acoustic`, Adrián-Martínez et al. proposed an acoustic detection method with cross-correlation and conducted the experiment in practical cases with noise and reverb. However, this method only applied to pure-tone signals or swept-sine signals. Thus, for signal composed of multiple harmonic signals, some research groups extracted features from spectrograms instead of retrieval in time domain. Wold et al. `[3]wold1996content` built an audio retrieval system based on perceptual and acoustical features, and Wan et al. `[4]wan2001content` improved the system by decreasing the computation and searching time with the kNN clustering method to classify extracted features. In `[3][4]`, they extracted features not only from time domain but frequency and coefficient domain, such as loudness, pitch, or MFCCs, which indicated that the information in audio contained musical and acoustical contents. ### Music Information Retrieval Despite of that the researchers described the musical features like pitch and melodies in `[3][4]`, they examined the system only with short, single-gestalt sound. It showed the weakness of their systems in retrieving musical melodies. Therefore, different methods for music information retrieval has been proposed. In `[5]borjian2017survey`, this paper briefly reviewed on several QBE-based MIR systems. Tsai et al. `[6]tsai2005query` presented a method to retrieve cover versions of songs with the similar melodies in vocal. They retrieved the melodies by converted songs into MIDI files and compared the similarities between notes of the main melody in vocal portion. Due to the information loss while extracting features from converted MIDI files instead of audio files, this method only performed well in query-by-singing-example cases. To address this limitation, some researchers processed directly with the signals to reduce the information loss. Hel´en and Virtanen `[7]helen2007similarity` used GMM and HMM models to paramterized the audio signals and retrieved from the database using different similarity measurements. As well, some other solutions in spectral domain were proposed. Shazam`[8]wang2003industrial` and Musiwave `[9]`, as well-known query-by-example systems, they extracted features from spectrogram to generate a sparse feature set, called audio fingerprints, for the searching mechanism. They also provided Application Programming Interface (API) for users to access the service through mobile, which proved their robust retrieval system under noisy environment. ### Semantic-based Audio Retrieval As mentioned in `[4]wan2001content}`, conceptual semantics were also manifest features for content-based audio retrieval system. In `[5]slaney2002semantic`, Slaney built a model to link features in acoustic and semantic space. This paper provided only a proof of concept that the semantic features are also significant in the system. Hence, some researchers processed with features in semantic domain. For instance, Barrington et al. `[6]barrington2007audio` described query-by-semantic-example (QBSE) method, compared with the common query-by-acoustic-example (QBAE) method. They yield better performance by retrieving with semantic similarities. However, this method exposed the weakness in strong similarities among synonyms. To solve this problem, Mesaros et al. `[7]mesaros2013query` proposed a similarity measurement combined with acoustic and semantic similarity, and improved the measurement among synonym labels with natural language processing and WordNet`[8]pedersen2004wordnet`. ### ==Sound Event Detection== To provide QBSE in audio retrieval system, the audio in the database must be labeled with tags. Thus, there were some researchers dedicated to sound event detection (SED), aimed at identifying the existence and occurence of various sound events in a recording. Adavanne et al. `[9]` proposed a framework using Convolutional Recurrent Neural Network (CRNN) with features extracted from multichannel audio. Considered the models like CRNN are fully-supervised, strong-labeled data should be employed in the learning. However, the labels available for majority of multimedia are generally weak. Some methods for semi-supervised learning were been proposed, like Kumar and Raj `[11]`. They proposed a framework using only weakly-labeled data. Although they decreased the cost of fully data annotation, the demand of weakly labeled dataset still existed. Therefore, It is still far to train a SED model to automatically label audio dataset. Hence, in this thesis we proposed an assistant system for data annotation using query-by-acoustic-exmaple method to bridge the gap between research and practice.