# 03/26 Meeting ## XiaoiceSing 1. F0 添加殘差鏈接 (residual connection) 用於減弱 off-key 問題,即走調問題。 2. 除了每個音素的音長損失外,還將一個音符中所有音素的音長積累起來,即音節的時長損失,計算用於增強節奏的音節音長損失。 3. 聲碼器WORLD, 對聲碼器的特徵進行建模,包括 Mel-Generalized Cepstral(MGC) 和 Band Aperiodicity (BAP) 而不是梅爾頻譜圖。 4. 各個模塊聯合訓練,保持更好的連續性和一致性 (之前研究中考慮了F0和時長的模型裡,各個模塊是分別訓練的,沒有考慮合成歌聲的連續性和一致性) ::: info * MGC(Mel-Generalized Cepstral)特徵:語音信號的梅爾倒譜系數的一種廣義形式。它們是一種表示聲道信息的特徵,通常用於語音合成中,**以捕捉語音特徵和其共振特性**。 * BAP(Band Aperiodicity)特徵:語音信號的頻帶無周期性特徵。它們通常用於語音合成中,用於**描述語音中的噪聲成分**,有助於更好地合成語音信號的高頻信息。 ::: --- ### Architecture ![image](https://hackmd.io/_uploads/BkKRTl1JR.png) :::success 1. A **musical score encoder** to convert phoneme name, note duration and pitch sequence into a dense vector sequence. * They are embedded **separately** into a dense vector in the same dimension and then **added together with position encoding**. The resulting vector is then passed to the encoder, which contains multiple FFT (Feed-Forward Transformer)blocks. * Each FFT block consists of a **self-attention network** and a **two-layer 1-D convolution network** with **ReLU** activation. 2. A **duration predictor** to get phoneme duration from the encoder vector sequence, and a **length regulator** to expand encoder vector sequence according to the predicted phoneme duration. The **predictor** and **regulator** sare based on [FastSpeech model](https://blog.csdn.net/weixin_42721167/article/details/118226439): ![image](https://hackmd.io/_uploads/H12YsmykC.png) 3. A **decoder** to generate acoustic features from the expanded encoded vectors. ::: ![image](https://hackmd.io/_uploads/SJCHq-yyC.png) ### Encoder * 以 120 BPM(每分鐘120個拍子)為例 * **Beats** 1 -> 四分音符(一拍) * **Beats in seconds** 0.5 -> 0.5s (60*s* * 1/120 = 0.5s) * **Beats in 15ms frame units** 33 -> 33 frame units (0.5*s* * 1000*ms/s* / 15*ms/frame* = 33.33 ) * **Position Encoding** * 為了**使模型能夠利用序列的順序**,作者需要插入一些關於tokens在序列中相對或絕對位置的信息。因此,提出了“Positional Encoding”(位置編碼)的概念。Positional Encoding和token embedding相加,作為encoder和decoder棧的底部輸入。 --- ### Duration Predictor * 在推理(輸出)階段用於根據輸入文本預測音素持續時間 * 由2層1d卷積神經網路組成並使用 ReLU 作為 活化函數,每一層都會進行層歸一化(Batch Normalization)和 dropout * 該模塊被疊加在音素側的 FFT 塊的頂部,並與 FastSpeech 模型聯合訓練 * TeacherTTS 模型中提取真實情況的音素持續時間,並使用這些信息來訓練 Duration Predictor :::info 訓練階段: * 使用完整的音素序列、音高(pitch)、音素持續時間和對應的頻譜圖來訓練一個自回歸的 Transformer TTS 模型(Teacher 模型) * 從 Teacher 提取注意對齊信息,並使用它來預測音素持續時間 * Duration Predictor 會和 Teacher 模型一起聯合訓練 推理階段: * 只有文本輸入,沒有真實的頻譜圖可供使用 * 使用 Duration Predictor 來根據文本預測音素持續時間 * 然後使用預測的音素持續時間與文本一起生成語音 ::: * **Length Regulator** * **音素序列的長度**通常==小於==其**梅爾頻譜圖序列的長度**,每個音素對應幾個梅爾頻譜圖聲譜。此外,將對應於音素的梅爾頻譜圖的長度稱為音素持續時間。 * 輸入(音素)序列和音素持續時間(Input Sequence and Phoneme Duration): * 輸入序列是由音素组成的隱藏狀態序列 $\mathcal{H}_{pho}=[h_1,h_2,...,h_n]$,其中 $n$ 是序列的長度。 * 音素持續時間序列 $\mathcal{D}=[d_1,d_2,...,d_n]$ 表示了每個音素的持續時間序列,其中 $\sum_{i=1}^n d_i = m$,$m$ 是頻譜圖序列的總長度。 * 操作: ![image](https://hackmd.io/_uploads/H126n411C.png) * Length Regultor ${LR}$ 的輸入是音素序列 ${H}_{pho}$ 和其持續時間序列${D}$,以及一個==超參數 $\alpha$==,用於控制擴展序列的長度,從而控制語音速度。 * 長度調節器的輸出是一個擴展後的音素序列 $\mathcal{H}_{mel}$,其長度等於聲譜圖序列的長度 $m$。 * 給定音素$H_{pho} = [h_1, h_2, h_3, h_4]$,對應的持續時間$D = [2, 2, 3, 1]$: * 當 $\alpha = 1$ 時(正常速度),基於公式的擴展序列$H_{mel}$為$[h_1, h_1, h_2, h_2, h_3, h_3, h_3, h_4]$ * 當 $\alpha = 0.5$ 時(速度快),$D_{\alpha=0.5} = [1, 1, 1.5, 0.5] \approx [1, 1, 2, 1]$,對應時間持續$D = [h_1, h_2, h_3, h_3, h_4]$ * **Loss Function** * $L_{dur} = w_{pd} \times L_{pd} + w_{sd} \times L_{sd}$ * Only focusing on learning phoneme level duration loss $L_{pd}$ is not enough to achieve good rhythmic pattern * They propose to add a control for **syllable-level duration** loss $L_{sd}$, one syllable may corresponds to one or more notes :::info Difference between phoneme and syllable duration ("**Hello /hɛloʊ/**" as an example): * **Phoneme** :/h/ for 0.1s,/ɛ/ for 0.2s,/l/ for 0.3s,/oʊ/ for 0.4s * **Syllable** :/hɛl/ for 0.6s, /oʊ/ for 0.4s (音節內所有音素預測時長之和) ::: ![image](https://hackmd.io/_uploads/SyaemGxJR.png) --- ### Decoder * The decoder **predict MGC and BAP features** instead of mel- spectrogram * Using **WORLD vocoder** to generate waveform * Loss function for spectral parameters * $L_{spec} = w_{m} \times L_{m} + w_{b} \times L_{b}$, * $L_{m}$ and $L_{b}$ for the loss of MGC and BAP * Singing has more complicated and sensitive F0 contour (e.g. vibrato and overshoot) * a little deviation from the standard pitch will impair the listening experience * the training data can hardly cover all the pitch range with sufficient case (F0 prediction **may have issues if the input note pitch is not shown or rare in training data**) * solution : * **residual connection between input and output pitch** (here we use logF0, the log scale of F0) * In this way, the decoder only need to predict the human bias from the standard note pitch (模型只需預測相對較小的偏差而不是準確的F0值 * F0 prediction is accompanied by V/UV (Voiced/Unvoiced) decision which is binary * Loss function for the decoder * $L_{dec} = L_{spec} + w_{f} \times L_{f} + w_{u} \times L_{u}$, * $L_{f}$ and $L_{u}$ for the loss of LogF0 and V/UV decision --- ## XiaoiceSing2 ![image](https://hackmd.io/_uploads/Hy6OAxyyA.png) 1. In Xiaoicesing, the mel-spectrogram generated by it is over-smoothing in middle- and high-frequency areas 2. Xiaoicesing2 better construct the full-band mel-spectrogram by **adding ConvFFTblock and multi-band discriminator** ## ==CrossSinger== ![image](https://hackmd.io/_uploads/SJNKKxx1R.png) 1. Language Embedding 2. Singer Embedding 3. GRL --- ## ==BiSinger== ![image](https://hackmd.io/_uploads/S1GntelJA.png) :dart:[Code](https://github.com/BiSinger-SVS/BiSinger) --- ## [DeepSinger](https://zhuanlan.zhihu.com/p/341982191) ![image](https://hackmd.io/_uploads/r1AY1Hky0.png) (待更新) :::warning 1. 模型一樣 based on **fastspeech** 2. 使用網絡爬蟲獲取線上的歌聲資料,再分離歌曲中的歌聲和伴奏,在獲取時長信息後,篩選數據進行訓練 ::: --- ## 台語參考資料 * 臺灣言語工具 * [說明文件](https://i3thuan5.github.io/tai5-uan5_gian5-gi2_kang1-ku7/index.html) \ [Github](https://github.com/i3thuan5/tai5-uan5_gian5-gi2_kang1-ku7) * [閩南語拼音轉換表](https://github.com/kfcd/pheng-im/blob/master/pheng-im-pio) * [Demo](https://colab.research.google.com/drive/1SdJF0mk1hflgmfrY4xm0mPA--yULxfg-#scrollTo=ucXtep-kIx9C) * 台語音資料 * [台灣媠聲 2.0](https://suisiann-dataset.ithuan.tw/) * 台語歌詞 * [台語歌真正正字歌詞](https://hackmd.io/@Et47FKHKRS2m83n-aEjwAA/r1A-z8obE) * [台語歌詞共同編修平台](https://kuasu.tgb.org.tw/) ### 台語資料集製作(根據**DeepSinger**) * 歌詞轉IPA * 歌曲\歌詞 對齊 * 先從同個歌手歌曲搜集(?