Chapter 3 Methodology

###### tags: `draft` `thesis` `jptw` # Chapter 3 Methodology ## 3.1 System Overview ![System Overview](https://i.imgur.com/UdH9nMy.png) **Fig. 3.1**: The proposed system overview. There are front-end and back-end of the system: the frontend is the user interface and the backend is the algorithm of ourmethod. The overall architeture of the proposed system is illustrated in Fig. 3.1. The system provides APIs on web for users to query audio from the referenced database, which would be described in Section 4. Thus, we could divide the system into front-end and back-end. In the backend implemented the main algorithm - Audio Fingerprinting (AF), which consists of two stage: the fingerprinting technique and the searching mechanism; in the frontend created the web-based interactive interface, which is user-friendly for detection and annotation. More details would be discussed in the following sub-sections. ## 3.2 Audio Fingerprinting (AF) ![af-system](https://i.imgur.com/vnL23eJ.png) **Fig. 3.2**: The block diagram of audio fingerprinting. This sub-section gives a comprehensive description about our method. The block diagram of QBE-based system to retrieve audio information is shown in Fig.3.2. The retrieving method used in the system is called Audio Fingerprinting (AF). The audio query is normally a 15-30 second audio excerpt. We extracted features in time and spectrum domain from the dataset and the audio excerpt, and generated the feature set called audio fingerprint, which is like the unique identification for each audio file. Then, we compared the feature set in between the referenced database and the query to filter out the similar audio files as top-$N$ candidates for users to choose. The upper flow in Fig. 3.2 showed the flow of generating feature set and the below was the searching flow. ### 3.2.1 Fingerprinting The description below would show the fingerprint extraction step-by-step, and it would be used over all system since the method is the same for both the reference audio dataset and the audio query. > Step 1. The audio signal is converted into spectrogram by using a $N$ frame length with overlapping ratio of $0.5$ > Step 2. Using the image processing method to find peaks on the spectrogram. Use a $M\times M$ filter to find local maxium within the neighborhood. > Step 3. In account of handling the noise in the background, reduce the detected peaks number by binary erosion`[14]liang1989erosion` (which is a common method for peak detection in image processing), and a threshold value $K$ (remove the peaks if the energy was under $K$). > Step 4. Every peak point $c$ would take turns to be the starting point, or called cursor, and paired with the peak points in the target zone $T$, with the differnce to cursor $c$ (The difference distance is in the range of frequency from $-\Delta_f$ to $+\Delta_f$ and time from 0 to $+\Delta_t$. > Step 5. For the pairs of cursor $c$ and the peaks in $T$, compute the frequency of $c$ and $p$, and the time difference $\Delta t$ with hash function (we used SHA1 hashing algorithm) to ensure the uniqueness of each pair. The hashes generated from a audio file are its audio fingeprints. For the referenced audio dataset, we extracted the audio fingerprints for all audio files and stored these features with the onset time for each cursor into the database (which is the fingerprints database). Then, the database preparation was done. When querying the target audio snippet, we searched for the same hashes from the fingerprints database. We would explain the details of searching mechanism in the next subsection. ![af-peaks](https://i.imgur.com/lpJ24jm.jpg) **Fig. 3.3**: The spectrogram in dB scale converted from an audio file, and the peaksdetected on the spectrogram. ![af-hashing](https://i.imgur.com/9OjRFyy.png) **Fig. 3.4**: The cursor $c$ and its paired points in the target zone. The left figure showed the parameters $\Delta_t$ and $\Delta_f$ used in the zone, and the relation between the cursor $c$ and the peak point $p$ (i.e., the frequency $f1$ and $f2$, and the time difference $\Delta t$). ### 3.2.2 Searching ![af-search](https://i.imgur.com/Dkitd6A.png) **Fig. 3.5**: An illustration for the searching mechanism. In the searching stage, we compared the fingerprint hashes extracted from the query and the dataset. As shown in Fig. 3.5, (a) illustrated the same hashes found both in the query and the referenced audio. Hash A and hash B represented the expected hashes, and hash C was the noise distortion. Then, we computed the time difference between the query and the referenced audio. Supposed that there were two segments same with the query, we would get different time difference $\Delta t_1$ and $\Delta t_2$, respectively for each segment. In Fig. 3.5 (b) and (c) we could figure out that the number of expected hashes was much more than the distortion hashes. Next, we added up the number of hashes with the same time difference and sorted them as a sorted hashes series. Finally, we got the segments with $\Delta t_1$ and $\Delta t_2$ as the result. ![af-search-2](https://i.imgur.com/uSkOShi.png) **Fig. 3.6**: The onset and offset time for each selected segment. Also, for each segment, We could get the duration time computed by the onset and offset time of itself, which is shown in Fig. 3.6. The time info $T_{onset}$ and $T_{offset}$ is the time $T$ in the begin and the end of the sorted hashes series. Actually, there would be several segments similar to the query in an audio file, so we defined the parameters: $N$ for top-$N$ candidates, threshold $k_h$ and $k_d$, to filter out the candidates when we selected the candidate segments. $k_h$ is the threshold for the sum of number of hashes with the same time difference, and $k_d$ is for the duration for selected segments, which means that we would not select the segment if its compared hashes are not more than $k_h$ or the duration is less than $k_d$. ## ~~3.3 Cross Correlation (CC)~~ The cross correlation function $c[k]$ with time lag $k$ is defined as follows, $$c_{au}[k] = \Sigma_{i=0}^{max(M,N)-1}a[i]v[i+k]$$ where $a$ and $v$ are the zero-padded data of the audio query and the referenced audio, and $M$ and $N$ are the length of each sequence, respectively. ## 3.4 User Interface ![User Interface](https://i.imgur.com/UO0gtk0.png) Given the proposed system is an assistant tool, we developed a web-based user interface for the system. With this interactive interface, users can upload the audio query they want to the system and get the recommended result as Fig. \ref{fig:system-ui} showed. Users can listen to the candidate segment of each audio file and adjust the region boundary. After the modification, they can export their annotation record to a csv file, in the format like Table. 3.1. |index|onset|offset|label| |---|---|---|---| |1|19.27|29.73|TPE_Metro_BLine| |2|77.76|88.22|TPE_Metro_BLine| |3|138.25|148.71|TPE_Metro_BLine| **Table 3.1**: The example annotation format of exported csv file