03/26 Meeting - HackMD

# 03/26 Meeting ## XiaoiceSing 1. F0 添加殘差鏈接 (residual connection) 用於減弱 off-key 問題，即走調問題。 2. 除了每個音素的音長損失外，還將一個音符中所有音素的音長積累起來，即音節的時長損失，計算用於增強節奏的音節音長損失。 3. 聲碼器WORLD, 對聲碼器的特徵進行建模，包括 Mel-Generalized Cepstral（MGC）和 Band Aperiodicity (BAP) 而不是梅爾頻譜圖。 4. 各個模塊聯合訓練，保持更好的連續性和一致性 (之前研究中考慮了F0和時長的模型裡，各個模塊是分別訓練的，沒有考慮合成歌聲的連續性和一致性) ::: info * MGC（Mel-Generalized Cepstral）特徵：語音信號的梅爾倒譜系數的一種廣義形式。它們是一種表示聲道信息的特徵，通常用於語音合成中，**以捕捉語音特徵和其共振特性**。 * BAP（Band Aperiodicity）特徵：語音信號的頻帶無周期性特徵。它們通常用於語音合成中，用於**描述語音中的噪聲成分**，有助於更好地合成語音信號的高頻信息。 ::: --- ### Architecture ![image](https://hackmd.io/_uploads/BkKRTl1JR.png) :::success 1. A **musical score encoder** to convert phoneme name, note duration and pitch sequence into a dense vector sequence. * They are embedded **separately** into a dense vector in the same dimension and then **added together with position encoding**. The resulting vector is then passed to the encoder, which contains multiple FFT (Feed-Forward Transformer)blocks. * Each FFT block consists of a **self-attention network** and a **two-layer 1-D convolution network** with **ReLU** activation. 2. A **duration predictor** to get phoneme duration from the encoder vector sequence, and a **length regulator** to expand encoder vector sequence according to the predicted phoneme duration. The **predictor** and **regulator** sare based on [FastSpeech model](https://blog.csdn.net/weixin_42721167/article/details/118226439): ![image](https://hackmd.io/_uploads/H12YsmykC.png) 3. A **decoder** to generate acoustic features from the expanded encoded vectors. ::: ![image](https://hackmd.io/_uploads/SJCHq-yyC.png) ### Encoder * 以 120 BPM（每分鐘120個拍子）為例 * **Beats** 1 -> 四分音符（一拍） * **Beats in seconds** 0.5 -> 0.5s (60*s* * 1/120 = 0.5s) * **Beats in 15ms frame units** 33 -> 33 frame units (0.5*s* * 1000*ms/s* / 15*ms/frame* = 33.33 ) * **Position Encoding** * 為了**使模型能夠利用序列的順序**，作者需要插入一些關於tokens在序列中相對或絕對位置的信息。因此，提出了“Positional Encoding”（位置編碼）的概念。Positional Encoding和token embedding相加，作為encoder和decoder棧的底部輸入。 --- ### Duration Predictor * 在推理（輸出）階段用於根據輸入文本預測音素持續時間 * 由2層1d卷積神經網路組成並使用 ReLU 作為活化函數，每一層都會進行層歸一化（Batch Normalization）和 dropout * 該模塊被疊加在音素側的 FFT 塊的頂部，並與 FastSpeech 模型聯合訓練 * TeacherTTS 模型中提取真實情況的音素持續時間，並使用這些信息來訓練 Duration Predictor :::info 訓練階段： * 使用完整的音素序列、音高（pitch）、音素持續時間和對應的頻譜圖來訓練一個自回歸的 Transformer TTS 模型（Teacher 模型） * 從 Teacher 提取注意對齊信息，並使用它來預測音素持續時間 * Duration Predictor 會和 Teacher 模型一起聯合訓練推理階段： * 只有文本輸入，沒有真實的頻譜圖可供使用 * 使用 Duration Predictor 來根據文本預測音素持續時間 * 然後使用預測的音素持續時間與文本一起生成語音 ::: * **Length Regulator** * **音素序列的長度**通常==小於==其**梅爾頻譜圖序列的長度**，每個音素對應幾個梅爾頻譜圖聲譜。此外，將對應於音素的梅爾頻譜圖的長度稱為音素持續時間。 * 輸入（音素）序列和音素持續時間（Input Sequence and Phoneme Duration）： * 輸入序列是由音素组成的隱藏狀態序列 $\mathcal{H}_{pho}=[h_1,h_2,...,h_n]$，其中 $n$ 是序列的長度。 * 音素持續時間序列 $\mathcal{D}=[d_1,d_2,...,d_n]$ 表示了每個音素的持續時間序列，其中 $\sum_{i=1}^n d_i = m$，$m$ 是頻譜圖序列的總長度。 * 操作： ![image](https://hackmd.io/_uploads/H126n411C.png) * Length Regultor ${LR}$ 的輸入是音素序列 ${H}_{pho}$ 和其持續時間序列${D}$，以及一個==超參數 $\alpha$==，用於控制擴展序列的長度，從而控制語音速度。 * 長度調節器的輸出是一個擴展後的音素序列 $\mathcal{H}_{mel}$，其長度等於聲譜圖序列的長度 $m$。 * 給定音素$H_{pho} = [h_1, h_2, h_3, h_4]$，對應的持續時間$D = [2, 2, 3, 1]$： * 當 $\alpha = 1$ 時（正常速度)，基於公式的擴展序列$H_{mel}$為$[h_1, h_1, h_2, h_2, h_3, h_3, h_3, h_4]$ * 當 $\alpha = 0.5$ 時（速度快），$D_{\alpha=0.5} = [1, 1, 1.5, 0.5] \approx [1, 1, 2, 1]$，對應時間持續$D = [h_1, h_2, h_3, h_3, h_4]$ * **Loss Function** * $L_{dur} = w_{pd} \times L_{pd} + w_{sd} \times L_{sd}$ * Only focusing on learning phoneme level duration loss $L_{pd}$ is not enough to achieve good rhythmic pattern * They propose to add a control for **syllable-level duration** loss $L_{sd}$, one syllable may corresponds to one or more notes :::info Difference between phoneme and syllable duration ("**Hello /hɛloʊ/**" as an example): * **Phoneme** ：/h/ for 0.1s，/ɛ/ for 0.2s，/l/ for 0.3s，/oʊ/ for 0.4s * **Syllable** ：/hɛl/ for 0.6s， /oʊ/ for 0.4s (音節內所有音素預測時長之和) ::: ![image](https://hackmd.io/_uploads/SyaemGxJR.png) --- ### Decoder * The decoder **predict MGC and BAP features** instead of mel- spectrogram * Using **WORLD vocoder** to generate waveform * Loss function for spectral parameters * $L_{spec} = w_{m} \times L_{m} + w_{b} \times L_{b}$, * $L_{m}$ and $L_{b}$ for the loss of MGC and BAP * Singing has more complicated and sensitive F0 contour (e.g. vibrato and overshoot) * a little deviation from the standard pitch will impair the listening experience * the training data can hardly cover all the pitch range with sufficient case （F0 prediction **may have issues if the input note pitch is not shown or rare in training data**） * solution : * **residual connection between input and output pitch** (here we use logF0, the log scale of F0) * In this way, the decoder only need to predict the human bias from the standard note pitch （模型只需預測相對較小的偏差而不是準確的F0值 * F0 prediction is accompanied by V/UV (Voiced/Unvoiced) decision which is binary * Loss function for the decoder * $L_{dec} = L_{spec} + w_{f} \times L_{f} + w_{u} \times L_{u}$, * $L_{f}$ and $L_{u}$ for the loss of LogF0 and V/UV decision --- ## XiaoiceSing2 ![image](https://hackmd.io/_uploads/Hy6OAxyyA.png) 1. In Xiaoicesing, the mel-spectrogram generated by it is over-smoothing in middle- and high-frequency areas 2. Xiaoicesing2 better construct the full-band mel-spectrogram by **adding ConvFFTblock and multi-band discriminator** ## ==CrossSinger== ![image](https://hackmd.io/_uploads/SJNKKxx1R.png) 1. Language Embedding 2. Singer Embedding 3. GRL --- ## ==BiSinger== ![image](https://hackmd.io/_uploads/S1GntelJA.png) :dart:[Code](https://github.com/BiSinger-SVS/BiSinger) --- ## [DeepSinger](https://zhuanlan.zhihu.com/p/341982191) ![image](https://hackmd.io/_uploads/r1AY1Hky0.png) （待更新） :::warning 1. 模型一樣 based on **fastspeech** 2. 使用網絡爬蟲獲取線上的歌聲資料，再分離歌曲中的歌聲和伴奏，在獲取時長信息後，篩選數據進行訓練 ::: --- ## 台語參考資料 * 臺灣言語工具 * [說明文件](https://i3thuan5.github.io/tai5-uan5_gian5-gi2_kang1-ku7/index.html) \ [Github](https://github.com/i3thuan5/tai5-uan5_gian5-gi2_kang1-ku7) * [閩南語拼音轉換表](https://github.com/kfcd/pheng-im/blob/master/pheng-im-pio) * [Demo](https://colab.research.google.com/drive/1SdJF0mk1hflgmfrY4xm0mPA--yULxfg-#scrollTo=ucXtep-kIx9C) * 台語音資料 * [台灣媠聲 2.0](https://suisiann-dataset.ithuan.tw/) * 台語歌詞 * [台語歌真正正字歌詞](https://hackmd.io/@Et47FKHKRS2m83n-aEjwAA/r1A-z8obE) * [台語歌詞共同編修平台](https://kuasu.tgb.org.tw/) ### 台語資料集製作（根據**DeepSinger**) * 歌詞轉IPA * 歌曲\歌詞對齊 * 先從同個歌手歌曲搜集（？