# 03/26 Meeting
## XiaoiceSing
1. F0 添加殘差鏈接 (residual connection) 用於減弱 off-key 問題,即走調問題。
2. 除了每個音素的音長損失外,還將一個音符中所有音素的音長積累起來,即音節的時長損失,計算用於增強節奏的音節音長損失。
3. 聲碼器WORLD, 對聲碼器的特徵進行建模,包括 Mel-Generalized Cepstral(MGC) 和 Band Aperiodicity (BAP) 而不是梅爾頻譜圖。
4. 各個模塊聯合訓練,保持更好的連續性和一致性
(之前研究中考慮了F0和時長的模型裡,各個模塊是分別訓練的,沒有考慮合成歌聲的連續性和一致性)
::: info
* MGC(Mel-Generalized Cepstral)特徵:語音信號的梅爾倒譜系數的一種廣義形式。它們是一種表示聲道信息的特徵,通常用於語音合成中,**以捕捉語音特徵和其共振特性**。
* BAP(Band Aperiodicity)特徵:語音信號的頻帶無周期性特徵。它們通常用於語音合成中,用於**描述語音中的噪聲成分**,有助於更好地合成語音信號的高頻信息。
:::
---
### Architecture

:::success
1. A **musical score encoder** to convert phoneme name, note duration and pitch sequence into a dense vector sequence.
* They are embedded **separately** into a dense vector in the same dimension and then **added together with position encoding**. The resulting vector is then passed to the encoder, which contains multiple FFT (Feed-Forward Transformer)blocks.
* Each FFT block consists of a **self-attention network** and a **two-layer 1-D convolution network** with **ReLU** activation.
2. A **duration predictor** to get phoneme duration from the encoder vector sequence, and a **length regulator** to expand encoder vector sequence according to the predicted phoneme duration. The **predictor** and **regulator** sare based on [FastSpeech model](https://blog.csdn.net/weixin_42721167/article/details/118226439):

3. A **decoder** to generate acoustic features from the expanded encoded vectors.
:::

### Encoder
* 以 120 BPM(每分鐘120個拍子)為例
* **Beats** 1 -> 四分音符(一拍)
* **Beats in seconds** 0.5 -> 0.5s (60*s* * 1/120 = 0.5s)
* **Beats in 15ms frame units** 33 -> 33 frame units (0.5*s* * 1000*ms/s* / 15*ms/frame* = 33.33 )
* **Position Encoding**
* 為了**使模型能夠利用序列的順序**,作者需要插入一些關於tokens在序列中相對或絕對位置的信息。因此,提出了“Positional Encoding”(位置編碼)的概念。Positional Encoding和token embedding相加,作為encoder和decoder棧的底部輸入。
---
### Duration Predictor
* 在推理(輸出)階段用於根據輸入文本預測音素持續時間
* 由2層1d卷積神經網路組成並使用 ReLU 作為 活化函數,每一層都會進行層歸一化(Batch Normalization)和 dropout
* 該模塊被疊加在音素側的 FFT 塊的頂部,並與 FastSpeech 模型聯合訓練
* TeacherTTS 模型中提取真實情況的音素持續時間,並使用這些信息來訓練 Duration Predictor
:::info
訓練階段:
* 使用完整的音素序列、音高(pitch)、音素持續時間和對應的頻譜圖來訓練一個自回歸的 Transformer TTS 模型(Teacher 模型)
* 從 Teacher 提取注意對齊信息,並使用它來預測音素持續時間
* Duration Predictor 會和 Teacher 模型一起聯合訓練
推理階段:
* 只有文本輸入,沒有真實的頻譜圖可供使用
* 使用 Duration Predictor 來根據文本預測音素持續時間
* 然後使用預測的音素持續時間與文本一起生成語音
:::
* **Length Regulator**
* **音素序列的長度**通常==小於==其**梅爾頻譜圖序列的長度**,每個音素對應幾個梅爾頻譜圖聲譜。此外,將對應於音素的梅爾頻譜圖的長度稱為音素持續時間。
* 輸入(音素)序列和音素持續時間(Input Sequence and Phoneme Duration):
* 輸入序列是由音素组成的隱藏狀態序列 $\mathcal{H}_{pho}=[h_1,h_2,...,h_n]$,其中 $n$ 是序列的長度。
* 音素持續時間序列 $\mathcal{D}=[d_1,d_2,...,d_n]$ 表示了每個音素的持續時間序列,其中 $\sum_{i=1}^n d_i = m$,$m$ 是頻譜圖序列的總長度。
* 操作:

* Length Regultor ${LR}$ 的輸入是音素序列 ${H}_{pho}$ 和其持續時間序列${D}$,以及一個==超參數 $\alpha$==,用於控制擴展序列的長度,從而控制語音速度。
* 長度調節器的輸出是一個擴展後的音素序列 $\mathcal{H}_{mel}$,其長度等於聲譜圖序列的長度 $m$。
* 給定音素$H_{pho} = [h_1, h_2, h_3, h_4]$,對應的持續時間$D = [2, 2, 3, 1]$:
* 當 $\alpha = 1$ 時(正常速度),基於公式的擴展序列$H_{mel}$為$[h_1, h_1, h_2, h_2, h_3, h_3, h_3, h_4]$
* 當 $\alpha = 0.5$ 時(速度快),$D_{\alpha=0.5} = [1, 1, 1.5, 0.5] \approx [1, 1, 2, 1]$,對應時間持續$D = [h_1, h_2, h_3, h_3, h_4]$
* **Loss Function**
* $L_{dur} = w_{pd} \times L_{pd} + w_{sd} \times L_{sd}$
* Only focusing on learning phoneme level duration loss $L_{pd}$ is not enough to achieve good rhythmic pattern
* They propose to add a control for **syllable-level duration** loss $L_{sd}$, one syllable may corresponds to one or more notes
:::info
Difference between phoneme and syllable duration ("**Hello /hɛloʊ/**" as an example):
* **Phoneme** :/h/ for 0.1s,/ɛ/ for 0.2s,/l/ for 0.3s,/oʊ/ for 0.4s
* **Syllable** :/hɛl/ for 0.6s, /oʊ/ for 0.4s (音節內所有音素預測時長之和)
:::

---
### Decoder
* The decoder **predict MGC and BAP features** instead of mel- spectrogram
* Using **WORLD vocoder** to generate waveform
* Loss function for spectral parameters
* $L_{spec} = w_{m} \times L_{m} + w_{b} \times L_{b}$,
* $L_{m}$ and $L_{b}$ for the loss of MGC and BAP
* Singing has more complicated and sensitive F0 contour (e.g. vibrato and overshoot)
* a little deviation from the standard pitch will impair the listening experience
* the training data can hardly cover all the pitch range with sufficient case (F0 prediction **may have issues if the input note pitch is not shown or rare in training data**)
* solution :
* **residual connection between input and output pitch** (here we use logF0, the log scale of F0)
* In this way, the decoder only need to predict the human bias from the standard note pitch (模型只需預測相對較小的偏差而不是準確的F0值
* F0 prediction is accompanied by V/UV (Voiced/Unvoiced) decision which is binary
* Loss function for the decoder
* $L_{dec} = L_{spec} + w_{f} \times L_{f} + w_{u} \times L_{u}$,
* $L_{f}$ and $L_{u}$ for the loss of LogF0 and V/UV decision
---
## XiaoiceSing2

1. In Xiaoicesing, the mel-spectrogram generated by it is over-smoothing in middle- and high-frequency areas
2. Xiaoicesing2 better construct the full-band mel-spectrogram by **adding ConvFFTblock and multi-band discriminator**
## ==CrossSinger==

1. Language Embedding
2. Singer Embedding
3. GRL
---
## ==BiSinger==

:dart:[Code](https://github.com/BiSinger-SVS/BiSinger)
---
## [DeepSinger](https://zhuanlan.zhihu.com/p/341982191)

(待更新)
:::warning
1. 模型一樣 based on **fastspeech**
2. 使用網絡爬蟲獲取線上的歌聲資料,再分離歌曲中的歌聲和伴奏,在獲取時長信息後,篩選數據進行訓練
:::
---
## 台語參考資料
* 臺灣言語工具
* [說明文件](https://i3thuan5.github.io/tai5-uan5_gian5-gi2_kang1-ku7/index.html) \ [Github](https://github.com/i3thuan5/tai5-uan5_gian5-gi2_kang1-ku7)
* [閩南語拼音轉換表](https://github.com/kfcd/pheng-im/blob/master/pheng-im-pio)
* [Demo](https://colab.research.google.com/drive/1SdJF0mk1hflgmfrY4xm0mPA--yULxfg-#scrollTo=ucXtep-kIx9C)
* 台語音資料
* [台灣媠聲 2.0](https://suisiann-dataset.ithuan.tw/)
* 台語歌詞
* [台語歌真正正字歌詞](https://hackmd.io/@Et47FKHKRS2m83n-aEjwAA/r1A-z8obE)
* [台語歌詞共同編修平台](https://kuasu.tgb.org.tw/)
### 台語資料集製作(根據**DeepSinger**)
* 歌詞轉IPA
* 歌曲\歌詞 對齊
* 先從同個歌手歌曲搜集(?