###### tags: `v1` `thesis` `jptw`
# Chapter 3 Methodology
## 3.1 System Overview

**Fig. 3.1**: The proposed system overview. There are front-end and back-end of the system: the frontend is the user interface and the backend is the algorithm of our system.
The overall architecture of the proposed QBEAT system is illustrated in Fig. 3.1. This system could be separated into two components: front-end and back-end. In the backend, this is the main algorithm design for audio retrieval; in the frontend, we created the web-based interactive interface for the system.
According to the reviewed work mentioned in Section 2, Audio Fingerprinting (AF) provided API for users to access the system and yielded good performance in retrieval under noisy and reverberant environment. Considered that audios in our dataset are real-life binural recordings with noise and reverb, we implemented AF method in our system. As shown in Fig 3.1, AF consists of two stage: fingerprinting technique and searching mechanism, and we made some modifications to it. Apart from the algorithm, the design of the user interface (UI) is also crucial. We provided a user-friendly interface for manipulating annotations. We would describe more details in the following sub-sections.
## 3.2 Audio Fingerprinting (AF)

**Fig. 3.2**: The block diagram of audio fingerprinting. The upper flow is fingerprinting, and the below is searching.
This section gives a comprehensive description about our method. The block diagram in Fig. 3.2 shows the procedure of retrieving information from audio. We first introduced the process of AF method and then described the modification.
In AF, we extracted features in time and spectral domain from both an audio query and audio dataset. An audio query in normally a 15-30 second audio excerpt. Then, combined these features into a feature set called audio fingerprints, which is like the unique identification of an audio file. After preparing the referenced audio fingerprint database, we could start to retrieve the audio query. In the searching stage, we compared the fingerprints from the query and the database to filter out the candidate audio files with the similar snippet. In Fig. 3.2, it explained the details for fingerprinting and searching in AF.
### 3.2.1 Fingerprinting
Since the fingerprint extraction would be used over all system for both the audio query and the referenced audio dataset, we described this method step-by-step first.
|sample rate (kHz)|frame length|hop size|window|
|---|---|---|---|
|44.1|4096|2048|hanning|
**Table 3.1**: The parameters for the conversion of spectrograms.
> **Step 1**: An audio signal is converted into spectrogram with parameters listed in Table. 3.1. (hop size defined by overlapping ratio 0.5 to frame length.)
>
> **Step 2**: Find peaks on spectrogram. We implemented local maxima filtering method of image-based processing. Determine which pixels have multiple maxima in neighborhood of size $M\times M$. There would be some noise or distortion in low-energy or high-frequency components in the background. To handle this issue, we used morphological erosion to remove peaks detected in these distortion portions.
>
> **Step 3**: Generate fingerprints using selected peaks. Every peak point would take turns to be the starting point, or called cursor. The cursor $c$ would correspond to the target zone $T$, which is indicated by the broken lines in Fig. 3.4. Compute the time difference between the cursor $c$ and the peak point $p$ in $T$ and hash it in rule defined in Eq. 3.1.


**Fig. 3.3**: Demonstration of morphological erosion for removing peaks of distortions in low-energy or high-frequency components in the background. (a) shows the signal centered at frequency between 13kHz to 19kHz. The top pannel in (b) shows the original background and the detected background after erosion; the bottom panel in (b) shows the peaks with and without removing distortion peaks in the detected background.

**Fig. 3.4**: The cursor $c$ and the peak point $p$ in the corresponded target zone $T$. The range of $T$ is defined as width of $\Delta_t$ and height of $2\Delta_f$.
The hash function $h[t1]$ with time difference $\Delta t$ is defined as follows:
$$hash_{c,p}[t1] =sha1(f1,f2,\Delta t)$$
where $c$ and $p$ are the peak point of the cursor and the peak in the target zone $T$, and $f1$ and $f2$ are the frequency of each point, respectively. SHA-1 `[15]eastlake2001us` is one of hash algorithm for computing a condensed representation of a message or a data file.
With the function $h[t]$, we can make sure the uniqueness of each fingerprint hash, even if the message composed with frequency and time difference is similar to other hashes.
We extracted audio fingerprints for audios in the referenced dataset, and recorded time point of cursor as the onset time of each hash. Next, we built a fingerprint database to store these feature sets. Then, the database preparation was done. When querying the target audio snippet, we could just search for the same hashes in the database. How we defined the searching mechanism would be discussed in the next section.
### 3.2.2 Searching

**Fig. 3.5**: Illustration for the searching mechanism. (a) shows the same hashes found both in the query and the referenced dataset. The red and blue line denote the hashes of features and the green denotes the hash of distortions. (b) shows the time difference of hashes between the query and the referenced audio. (c) shows hashes in sequence of their time difference from the query.
In the searching stage, we retrieved the fingerprint hashes generated from both the query and the referenced dataset. As shown in Fig. 3.5, we found the same hashes from the query and the audio in referenced dataset. The mechanism in general AF is counting the total hashes in an audio file and listing these files in sequence. However, we had to get not only the candidate audio files but also the related regions similar to the query. Thus, we modified the searching method as shown in Fig. 3.4.(b) and (c). We computed the time difference of hashes between the query and the referenced audio, and counted the number of hashes with the same time difference (i.e., $\Delta t1$ and $\Delta t2$). Then, sorted these hashes in the sequence of time difference from the query. Each group in the sorted hashes sequence consists of a group of hashes with the same time difference.

**Fig. 3.6**: Illustration of sorting group of hashes with their onset time for defining the boundary of a region.
Sorted these hashes again with their onset time, which is the series shown in Fig. 3.6. In the sorted time sequence, we defined the first and the last element as the onset and offset time of the region. Finally, we could get several regions in an audio file which is similar to the query.
<!-- ## 3.3 Cross Correlation (CC)
The cross correlation function $c[k]$ with time lag $k$ is defined as follows,
$$c_{au}[k] = \Sigma_{i=0}^{max(M,N)-1}a[i]v[i+k]$$
where $a$ and $v$ are the zero-padded data of the audio query and the referenced audio, and $M$ and $N$ are the length of each sequence, respectively.
-->
## 3.3 User Interface (UI)

**Fig. 3.7**: The user flow of our QBEAT system.

**Fig. 3.8**: The web-based user interface of our QBEAT system. This is the result page for annotation adjustment after audio query retrieving.
We developed a friendly user interface based on web design. As the user flow shown in Fig. 3.7, users upload an example audio snippet of the sound event they want to label, and then press 'start' to search and retrieve our audio database. After the retrieval is done, the resulting page is shown as Fig. 3.8. In this page, users can listen to labeled segments of the candidate audio file, and adjust the boundary of the region. They can also modify the name of tags. Finally, users can export and download these annotations to CSV files, which are named by default with the filename of audios. The CSV files would be in the format as shown in Table 3.2.
|index|onset|offset|label|
|---|---|---|---|
|1|19.27|29.73|TPE_Metro_BLine|
|2|77.76|88.22|TPE_Metro_BLine|
|3|138.25|148.71|TPE_Metro_BLine|
**Table 3.2**: Example format for annotations in an exported CSV file.