# X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning > Interspeech 2024 1-5 September 2024, Kos, Greece :+1: [Demo](https://jisang93.github.io/x-singer/) ## Introduction ### Background ![截圖 2025-01-12 凌晨2.49.01](https://hackmd.io/_uploads/Hkcmp4gPkg.png) * Most of the existing methods need phoneme-level annotation during training * A scarcity of multi-lingual SVS datasets which a vocalist sings multi-lingual songs * Converting different grapheme- or phoneme-based lyrics into the International Phonetic Alphabet (IPA) produces imprecise result *Fig.(b)* --- ### Previous work #### CrossSinger ![截圖 2025-01-12 凌晨2.43.44](https://hackmd.io/_uploads/SkGe3Eewyl.pn =500x) * Employed global language embedding for cross-lingual SVS and a gradient reversal layer to eliminate singer bias in the lyrics #### BiSinger ![截圖 2025-01-12 凌晨2.42.37](https://hackmd.io/_uploads/SJaoiElPJx.png =500x) * Adopted language code-switching * Explored the use of bi-lingual speech corpora to learn pronunciation --- ## X-Singer ### Overall architecture ![截圖 2025-01-12 凌晨2.23.49](https://hackmd.io/_uploads/rJyrwExDJx.png =500x) ### Musical score encoder > Reduce the dependency on phoneme-level annotation * ![截圖 2025-01-12 凌晨3.06.28](https://hackmd.io/_uploads/HJNBZBxDJx.png =500x) The MS encoder comprises a lyrics encoder, melody encoder, phoneme- to-note encoder, and phoneme-to-note cross-attention. #### Lyrics encoder * Unified lyrics annotation (converted to IPA symbol) ![截圖 2025-01-12 清晨5.06.57](https://hackmd.io/_uploads/BkeKTUgDJx.png) * The lyrics representation may be associated with the identity of the singer (singer bias) * Use **mix-layer normalization** (Mix-Ln) to solve the problem of singer bias and improve the generalization capability for unseen scenarios (e.g., Korean lyrics-Chinese singer ![截圖 2025-01-12 清晨5.04.33](https://hackmd.io/_uploads/HJbl6UgPkl.png) ![截圖 2025-01-12 清晨5.00.44](https://hackmd.io/_uploads/BknWnIlPJe.png) * $e_{\text{spk}}$:當前說話人的嵌入 * $\tilde{e}_{\text{spk}}$:通過 shuffle 操作獲得的其他說話人的嵌入(用於引入隨機性) * $\lambda \sim \text{Beta}(\alpha, \alpha)$: * 從 Beta 分佈中抽樣的混合比例 * 控制 $e_{\text{spk}}$ 和 $\tilde{e}_{\text{spk}}$ 的影響權重 #### Melody encoder * Encodes pitch, duration, and tempo embeddings at the note level. * Combines them into a note-level melody representation #### Phoneme- to-note encoder * Applies average pooling to compress phoneme-level features to note-level #### Phoneme-to-note cross-attention * Ensures mixture alignment between phonemes and notes * Guided Attention Loss ![截圖 2025-01-12 上午8.53.52](https://hackmd.io/_uploads/rk6jf5xDkx.png) * $A_{nt}$: Attention matrix between lyrics and Mel-frames * $W_{nt}$: Ideal alignment matrix, encourages close-to-diagonal attention. --- ### CFM-based decoder > Improve the quality of the synthesized audio #### Prior Encoder * Encodes the musical score representation $h_{ms}$ to an aligned hidden representation $h_{align}$ * $h_{align}$ is used to condition a conditional flow matching (CFM)-based decoder as the averaged acoustic feature ![截圖 2025-01-12 上午9.01.21](https://hackmd.io/_uploads/rkTDNqeD1e.png =500x) * $h_{align}$ 通過損失函數 $L_p$ 與目標梅爾頻譜 $x$對齊 #### Decoder * Models conditional flow generate Mel-spectrograms from Gaussian noise ![截圖 2025-01-12 上午9.24.04](https://hackmd.io/_uploads/SkHat9xPkg.png =600x) ![截圖 2025-01-12 上午9.28.27](https://hackmd.io/_uploads/HyDTc5lwke.png =700x) #### Total loss ![截圖 2025-01-12 上午9.00.20](https://hackmd.io/_uploads/HJJ44cxvJg.png =700x) * $L_p=MSE(h_{align},x)$:對齊損失,確保隱層表示與目標梅爾頻譜一致 * $L_{cfm}$ :conditional flow 匹配損失,學習生成過程 * $L_{ga}$:用於提升輸入與輸出的對齊效果 * set $λ_{ga}$ to 10.0 in this experiments --- ## Experiment > We trained X-Singer for 0.8M steps on ==four NVIDIA v100 GPUs== and set a batch size to 16 per GPU ### Dataset ![截圖 2025-01-12 凌晨2.28.01](https://hackmd.io/_uploads/HyHrOVlPJg.png) :::success CH = 21.59 (from GTSinger、opencpop) EN = 13.13 JP = 6.45 Total = 41.17 ::: **Offered phoneme-level annotation** - [ ] Multi-Speaker Singing Dataset - [x] M4Singer - [x] Ofuton-P Database | **Subset** | **Samples** | **Percentage (%)** | **KO (%)** | **ZH (%)** | **JP (%)** | |--------------|-------------|--------------------|------------|------------|------------| | Training | 92,253 | 89.04 | 85.18 | 14.27 | 0.55 | | Validation | 7,338 | 7.08 | 85.85 | 14.01 | 0.16 | | Testing | 3,973 | 3.83 | 74.51 | 24.36 | 1.08 | ## Evaluation | **指標類型** | **指標名稱** | **用途** | |----------------|------------------------------|-----------------------------------| | 主觀評估 | Mean Opinion Score ( $MOS$ ) | 測試生成歌聲音頻的自然度 | | 客觀評估 | Mel-Cepstral Distortion ( $MCD$ )| 測試頻譜相似性 | | 客觀評估 | F0 Root-Mean-Square Error ( $RMSE_{F0}$ ) | 測試基頻準確性 | | 客觀評估 | Equal Error Rate ( $EER$ ) | 測試音色相似性 | ![截圖 2025-01-12 清晨5.16.40](https://hackmd.io/_uploads/HyUp1vlDkx.png) ### Intra-Lingual SVS * X-Singer outperforms baseline models (e.g., BigVGAN, FFT-Singer, DiffSinger) in $nMOS$. * Achieves the lowest $RMSE_{F0}$. ### Cross-Lingual SVS: * Significant improvements in naturalness ## Ablation study ![截圖 2025-01-12 清晨5.17.32](https://hackmd.io/_uploads/H1Olewxw1l.png) * Removing CFM Decoder: * Significant decline in cMOS (naturalness score). * Replacing Mix-LN with FFT Blocks: * Decline in cMOS for both intra- and cross-lingual SVS. * Slight increase in EER, indicating reduced pronunciation accuracy. * Removing Guided Attention Loss: * Decreases cMOS, proving its importance for accurate alignment.