# CONTENTVEC: An Improved Self-Supervised Speech Representation by Disentangling Speakers - ICLR2022 ###### tags: `Yang` ##### Authors : Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni Cheng-I Jeff Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang ### 1. Introduction Over the recent years, self-supervised learning (SSL) has emerged as a state-of-the-art solution to many speech processing problems with relatively few annotated data. While speech SSL has demonstrated advantages in a surprisingly wide range of tasks, one of the primary foci of speech SSL is on tasks that process the content of speech, such as speech recognition/phone classification, speech content generation, etc. For these tasks, the most desirable speech representations should be the ones that can disentangle content information in speech from other interfering variations, such as speaker variations. However, it has been widely acknowledged that disentangling speakers is very challenging. Since no text annotations are accessible during the training of the speech representation network, any attempt to remove speaker variations from speech representation could easily lead to a loss of content information. In most content-related downstream tasks, the cost of losing content information far outweighs the advantage in disentangling speakers. In this paper, we seek to investigate the following two research questions. - First, is there a way to disentangle speaker variations during SSL training without significant content loss? - Second, how much performance gain, if any, can speaker disentanglment in SSL features contribute to downstream tasks? ### 2. Apporach ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_98c084a7a2813f1f7fc77e24af1393e2.png) The CONTENTVEC framework builds upon the mask-prediction framework of HUBERT. Specifically, there are three components in the HUBERT framework: 1) the speech representation network $f(·)$ 2) the predictor $p(·)$ 3) the teacher label generator $g(·)$. #### 2.1 Problem Formulation Denote $X = [X_1, · · · , X_T ]$ as the sequence of a speech features, where $X_t$ is the speech feature vector at frame $t$, and $T$ is the total number of frames. Our goal is to learn a speech representation network $R = f(X)$, where $R = [R_1, · · · , R_T ]$ and $R_t$ is the representation for frame $t$. #### 2.2 Disentanglement in Teachers Disentanglement in teachers aims to remove the speaker information in the teacher labels. As shown in Figure $1(c)$, the teacher labels, $L = g(X)$, are generated via the following three steps. 1) First, all the utterances X in the training set are converted to a single speaker using a competent unsupervised voice conversion system. 2) Second, the converted utterances are passed through a pre-trained unsupervised speech representation network, in our case HUBERT, to generate a set of speech representations, which should contain very little speaker information. 3) Finally, the speech representations are quantized to discrete teacher labels using k-means clustering. #### 2.3 Disentanglement in Students Specifically, as shown in Figure 1(a), each speech utterance, $X$, is passed into two random transformations that alter only the speaker information, before it is masked. Denote the two masked, transformed copies of $\boldsymbol{X}$ as $\tilde{\boldsymbol{X}}^{(1)}$ and $\tilde{\boldsymbol{X}}^{(2)}$. Then, this pair of utterances are passed through the speech represetnation network, $f(\cdot)$, to generate the representations $\boldsymbol{R}^{(1)}$ and $R^{(2)}$, and the following contrastive loss is introduced to penalize dissimilarity between $\boldsymbol{R}^{(1)}$ and $\boldsymbol{R}^{(2)}$ : $$ \begin{aligned} \mathcal{L}_{\text {contr }} &=\sum_{t=1}^{T} \frac{\exp \left(\operatorname{cossim}\left(\boldsymbol{R}_{t}^{(1)}, \boldsymbol{R}_{t}^{(2)}\right) / k\right)}{\sum_{\tau \in\{t\} \cup \mathcal{I}_{t}} \exp \left(\operatorname{cossim}\left(\boldsymbol{R}_{t}^{(1)}, \boldsymbol{R}_{\tau}^{(1)}\right) / k\right)} \\ &+\sum_{t=1}^{T} \frac{\exp \left(\operatorname{cossim}\left(\boldsymbol{R}_{t}^{(2)}, \boldsymbol{R}_{t}^{(1)}\right) / k\right)}{\sum_{\tau \in\{t\} \cup \mathcal{I}_{t}} \exp \left(\operatorname{cossim}\left(\boldsymbol{R}_{t}^{(2)}, \boldsymbol{R}_{\tau}^{(2)}\right) / k\right)}, \end{aligned} $$ #### 2.4 Speaker Conditioning Although disentanglement in teacher can remove the majority of the speaker information from the teacher labels, certain speaker information would remain. As a result, the student representations are undesirably forced to carry the same amount of speaker information as the teachers do in order to reasonably predict the teacher labels. To break this entailment between the speaker information in students and in teachers, we feed the speaker embeddings to the predictor. Speaker embeddings are produced by a speaker embedding network, in our case a pre-trained GE2E. Formally, the masked prediction loss now becomes $$ \begin{aligned} \mathcal{L}_{\text {pred }}=& \mathbb{E}\left[\ell_{m}\left(p\left(f\left(\tilde{\boldsymbol{X}}_{1}\right), s(\boldsymbol{X})\right), g(\boldsymbol{X})\right)\right.\\ &\left.+\ell_{m}\left(p\left(f\left(\tilde{\boldsymbol{X}}_{2}\right), s(\boldsymbol{X})\right), g(\boldsymbol{X})\right)\right] \end{aligned} $$ where $s(\boldsymbol{X})$ denotes the speaker embeddings. The final loss is the superposition of the prediction and contrastive losses: $$ \mathcal{L}=\mathcal{L}_{\text {pred }}+\lambda \mathcal{L}_{\text {contr }} . $$ #### 3. Experiments ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_a93cfcedecbcbcfbe4ecbcb85e3b8c9e.png)