###### tags: `v1` `thesis` `jptw` # Chapter 1 Introduction ## 1.1 Background and Motivation With the increasing of the elderly population, the aging issue has drawn attention to the public, such as a decline in working-age population or increased health care costs. In the aspect of the decreasing of working-age population, the labor markets overbalanced with the rising demands and limited supply, not to mentioned the requirement of society care and in-home health care. Hence, unobtrustive in-home health monitoring or surveillance has been highlighted in research since ever. Considered this situation, the monitoring device and automatic detection are closely related to this work. For example, the video camera, as the accessible device in daily life, catches and detects human behaviors through video streaming. Thus, many researchers have proposed several methods based on a great variety of models for multimedia processing, such as hidden markov model(HMM), recurrent neural network (RNN) and deep learning models (DL). Those models normally require large amount of labeled data since they are data-driven models. However, labeling a large amount of data requires tremendous cost for company, such as money, time, and human resources. To address this limitation, developing assistant labeling tools are extremely important. ## 1.2 Problem Statement Multimedia is a form of communication combining with different content forms such as text, image, audio, or video. For the signal processing with multimedia information, especially for video content, multimedia event detection (MED) has become a well-known task for several years. To analyze the content in video, most of the research was focused on the areas of computer vision. However, Audio analysis, which contains rich information for event detection, could also significantly contribute to the development. Thus, sound event detection (SED) has raised the interest in recent years, and there is even a community, i.e., DCASE, mainly dealing with sound-related tasks. Since the solution for SED is commonly data driven DL models, the fully or weakly-labeled datasets are extremely important. In addition, in the case that building a tailored system to fulfill the specific demands for different tasks, data annotation would be a necessary process. In this thesis, we emphasized on data annotation in audio domain and aimed to decrease the time cost of manually labeling on sound events. We developed a user-friendly annotation tool based on query-by-example (QBE) method. This annotation system, named QBEAT (Query-by-example Annotation Tool), assists users to label sound events with an example query of the sound, and export annotations to files in the format required by SED models. Finally, a pilot test is conducted to validate this system and gives advance signs on more possibilities. ## 1.3 Thesis Overview The rest of the thesis is organized as follows. Section 2 discusses related work on audio retrieval and sound event detection. Section 3 desccibes the proposed system design and audio fingerprinting method. Section 4 conducts exeprimental analysis and results, and Section 5 concludes and gives the direction of future research.