# X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning
> Interspeech 2024
1-5 September 2024, Kos, Greece
:+1: [Demo](https://jisang93.github.io/x-singer/)
## Introduction
### Background

* Most of the existing methods need phoneme-level annotation during training
* A scarcity of multi-lingual SVS datasets which a vocalist sings multi-lingual songs
* Converting different grapheme- or phoneme-based lyrics into the International Phonetic Alphabet (IPA) produces imprecise result *Fig.(b)*
---
### Previous work
#### CrossSinger

* Employed global language embedding for cross-lingual SVS and a gradient reversal layer to eliminate singer bias in the lyrics
#### BiSinger

* Adopted language code-switching
* Explored the use of bi-lingual speech corpora to learn pronunciation
---
## X-Singer
### Overall architecture

### Musical score encoder
> Reduce the dependency on phoneme-level annotation
* 
The MS encoder comprises a lyrics encoder, melody encoder, phoneme- to-note encoder, and phoneme-to-note cross-attention.
#### Lyrics encoder
* Unified lyrics annotation (converted to IPA symbol)

* The lyrics representation may be associated with the identity of the singer (singer bias)
* Use **mix-layer normalization** (Mix-Ln) to solve the problem of singer bias and improve the generalization capability for unseen scenarios (e.g., Korean lyrics-Chinese singer


* $e_{\text{spk}}$:當前說話人的嵌入
* $\tilde{e}_{\text{spk}}$:通過 shuffle 操作獲得的其他說話人的嵌入(用於引入隨機性)
* $\lambda \sim \text{Beta}(\alpha, \alpha)$:
* 從 Beta 分佈中抽樣的混合比例
* 控制 $e_{\text{spk}}$ 和 $\tilde{e}_{\text{spk}}$ 的影響權重
#### Melody encoder
* Encodes pitch, duration, and tempo embeddings at the note level.
* Combines them into a note-level melody representation
#### Phoneme- to-note encoder
* Applies average pooling to compress phoneme-level features to note-level
#### Phoneme-to-note cross-attention
* Ensures mixture alignment between phonemes and notes
* Guided Attention Loss

* $A_{nt}$: Attention matrix between lyrics and Mel-frames
* $W_{nt}$: Ideal alignment matrix, encourages close-to-diagonal attention.
---
### CFM-based decoder
> Improve the quality of the synthesized audio
#### Prior Encoder
* Encodes the musical score representation $h_{ms}$ to an aligned hidden representation $h_{align}$
* $h_{align}$ is used to condition a conditional flow matching (CFM)-based decoder as the averaged acoustic feature

* $h_{align}$ 通過損失函數 $L_p$ 與目標梅爾頻譜 $x$對齊
#### Decoder
* Models conditional flow generate Mel-spectrograms from Gaussian noise


#### Total loss

* $L_p=MSE(h_{align},x)$:對齊損失,確保隱層表示與目標梅爾頻譜一致
* $L_{cfm}$ :conditional flow 匹配損失,學習生成過程
* $L_{ga}$:用於提升輸入與輸出的對齊效果
* set $λ_{ga}$ to 10.0 in this experiments
---
## Experiment
> We trained X-Singer for 0.8M steps on ==four NVIDIA v100 GPUs== and set a batch size to 16 per GPU
### Dataset

:::success
CH = 21.59 (from GTSinger、opencpop)
EN = 13.13
JP = 6.45
Total = 41.17
:::
**Offered phoneme-level annotation**
- [ ] Multi-Speaker Singing Dataset
- [x] M4Singer
- [x] Ofuton-P Database
| **Subset** | **Samples** | **Percentage (%)** | **KO (%)** | **ZH (%)** | **JP (%)** |
|--------------|-------------|--------------------|------------|------------|------------|
| Training | 92,253 | 89.04 | 85.18 | 14.27 | 0.55 |
| Validation | 7,338 | 7.08 | 85.85 | 14.01 | 0.16 |
| Testing | 3,973 | 3.83 | 74.51 | 24.36 | 1.08 |
## Evaluation
| **指標類型** | **指標名稱** | **用途** |
|----------------|------------------------------|-----------------------------------|
| 主觀評估 | Mean Opinion Score ( $MOS$ ) | 測試生成歌聲音頻的自然度 |
| 客觀評估 | Mel-Cepstral Distortion ( $MCD$ )| 測試頻譜相似性 |
| 客觀評估 | F0 Root-Mean-Square Error ( $RMSE_{F0}$ ) | 測試基頻準確性 |
| 客觀評估 | Equal Error Rate ( $EER$ ) | 測試音色相似性 |

### Intra-Lingual SVS
* X-Singer outperforms baseline models (e.g., BigVGAN, FFT-Singer, DiffSinger) in $nMOS$.
* Achieves the lowest $RMSE_{F0}$.
### Cross-Lingual SVS:
* Significant improvements in naturalness
## Ablation study

* Removing CFM Decoder:
* Significant decline in cMOS (naturalness score).
* Replacing Mix-LN with FFT Blocks:
* Decline in cMOS for both intra- and cross-lingual SVS.
* Slight increase in EER, indicating reduced pronunciation accuracy.
* Removing Guided Attention Loss:
* Decreases cMOS, proving its importance for accurate alignment.