# Experiments and Results --- ## Contents | # | Experiment | | --- | ------------------------------------------------------------------------------------------- | | 1 | [Speaker Identification](#Speaker-Identification) | | 2 | [Speaker Verification](#Speaker-Verification) | | 3 | [Cross-Lingual Transfer](#Cross-Lingual-Transfer) | | 3a | [Using a Weighted Classifier *or* Fine-Tuning](#Using-a-Weighted-Classifier-or-Fine-Tuning) | | 4? | Effect of Linguistic Similarity | | 5? | Measuring Multilingual Phonotactic/Acoustic Surprisal | ## Context - [Situating the Investigation](#Situating-the-Investigation) --- ## Speaker Identification **Purpose:** to evaluate the upper bounds on the classifiers, to gauge what to expect from classifiers and ASR model in speaker verification probing task. **Data:** GlobalPhone audio .wav files **Experiment setup:** 3 conditions (based on meta-data for gender): - male: 6 speakers - female: 6 speakers - mixed: 3 male, 3 female speakers **Languages:** start with German, then French, Russian **Metrics and variables:** - For each speaker: 80 utterances (60 training, 20 evaluation) - X = matrix of speaker utterances, y = labels of speaker id - identification accuracy = correct label assigned to each speaker utterance vector **Classifiers:** from [scikit-learn](https://scikit-learn.org/stable/glossary.html#term-multiclass). Note that all estimators supporting binary classification (2 labels) also support multiclass classification by way of a one-vs-rest assigment. ### German: Results ![de_id_all](https://hackmd.io/_uploads/B1BVpbba2.png) ``` male, mean layer accuracy, accuracy per layer MLPClassifier: 0.9940972222222223 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 0.9833333333333333, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 0.9916666666666667, 0.9666666666666667, 0.9666666666666667] LogisticRegression_lbfgs: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] LogisticRegression_liblinear: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] SGDClassifier_squared_hinge: 0.9993055555555556 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667] female, mean layer accuracy, accuracy per layer MLPClassifier: 0.9993055555555556 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] LogisticRegression_lbfgs: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] LogisticRegression_liblinear: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] SGDClassifier_squared_hinge: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] mixed, mean layer accuracy, accuracy per layer MLPClassifier: 0.9986111111111112 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 0.9916666666666667, 1.0, 0.9916666666666667, 1.0, 1.0, 0.9916666666666667] LogisticRegression_lbfgs: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] LogisticRegression_liblinear: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] SGDClassifier_squared_hinge: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] ``` ### French: Results ![fr_id_all](https://hackmd.io/_uploads/SJVerCHCT.png) ``` male, mean layer accuracy, accuracy per layer MLPClassifier: 0.9968750000000001 [0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 0.9833333333333333, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667] LogisticRegression_lbfgs: 0.998263888888889 [0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667] LogisticRegression_liblinear: 0.9975694444444444 [0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667] SGDClassifier_squared_hinge: 0.9972222222222222 [0.9916666666666667, 1.0, 0.9916666666666667, 1.0, 0.9916666666666667, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 0.9916666666666667] female, mean layer accuracy, accuracy per layer MLPClassifier: 0.9986111111111112 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9833333333333333, 0.9833333333333333] LogisticRegression_lbfgs: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] LogisticRegression_liblinear: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] SGDClassifier_squared_hinge: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] mixed, mean layer accuracy, accuracy per layer MLPClassifier: 0.9968750000000001 [0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 0.975] LogisticRegression_lbfgs: 0.9968749999999998 [0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667] LogisticRegression_liblinear: 0.9961805555555555 [0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667] SGDClassifier_squared_hinge: 0.9958333333333332 [0.9916666666666667, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 0.9916666666666667, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 0.9916666666666667, 0.9916666666666667] ``` ### Russian: Results ![ru_id_all](https://hackmd.io/_uploads/r1WhJMWpn.png) ``` male, mean layer accuracy, accuracy per layer MLPClassifier: 0.9951388888888889 [0.9833333333333333, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 0.9833333333333333, 0.9833333333333333, 0.9833333333333333, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0] LogisticRegression_lbfgs: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] LogisticRegression_liblinear: 1.0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] SGDClassifier_squared_hinge: 0.9986111111111112 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667] female, mean layer accuracy, accuracy per layer MLPClassifier: 0.9888888888888889 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9833333333333333, 0.9833333333333333, 0.9833333333333333, 0.9833333333333333, 0.975, 0.9833333333333333, 0.975, 0.9916666666666667, 1.0, 0.9833333333333333, 0.9916666666666667, 0.9833333333333333, 0.9916666666666667, 0.9916666666666667, 0.9833333333333333, 0.9833333333333333, 0.975] LogisticRegression_lbfgs: 0.9979166666666668 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667] LogisticRegression_liblinear: 0.9989583333333334 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9833333333333333] SGDClassifier_squared_hinge: 0.9979166666666668 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9833333333333333, 0.9833333333333333] mixed, mean layer accuracy, accuracy per layer MLPClassifier: 0.9954861111111111 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 0.9916666666666667, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 0.9916666666666667, 0.9916666666666667, 0.9916666666666667, 0.9833333333333333, 0.9833333333333333] LogisticRegression_lbfgs: 0.9993055555555556 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667] LogisticRegression_liblinear: 0.9996527777777778 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0] SGDClassifier_squared_hinge: 0.9993055555555556 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 0.9916666666666667] ``` ### Conclusions: - German: Classifiers with the highest accuracy (1.0) for all layers in all conditions: - LogisticRegression with either 'lbfgs' or 'liblinear' as the optimization algorithm (LogisticRegression_lbfgs and LogisticRegression_liblinear) - To note, all 4 tested classifiers show nearly 1.0 accuracy across all 24 layers in all 3 conditions. - French: LogisticRegression_lbfgs - = classifier with the highest mean accuracy accross all layers, across all conditions - Russian: LogisticRegression_liblinear and LogisticRegression_lbfgs - = these 2 classifiers share the highest mean accuracies across layers in all 3 conditions. - While "liblinear" gains a mean accuracy of .0003 more than lbfgs in the mixed gender condition, the difference may be too small to significantly impact the next probing task (a hypothesis). - The LogisticRegression_liblinear and the LogisticRegression_lbfgs classifiers would seem to best perform in the next task of speaker verification. Since all tested classifiers displayed high accuracies above 0.97, for all layers and in all test conditions, either of these classifiers may perform well in the next task. ==> consider all ### Questions: - Currently show: model representations at each layer encode a dimension of speaker identity that can be learned by a classifier, when a set of speaker labels is given. How is speaker identity learned when new speakers are presented to a classifier? Use speaker verification to approach the problem of new speakers? ___ ## Speaker Verification <kbd>Version 2.0</kbd> **In this version:** - new method for taking the mean features over each frame tensor to produce a single utterance tensor, when gathering feature representations (embeddings) of an utterance at each layer. --> use average pooling augmented with standard deviation (see `mean_std() function` inside `gather_embeddings.py`) - updated u,v vector concatenation formula to take the absolute value. This computation is used to combine u,v vectors into single vectors that make up elements of the X matrix classifier input. - #concatenate u and v vectors: [|u-v| + (u ⊙ v)] `concat = np.concatenate([abs(np.subtract(u, v)), np.multiply(u, v)])` - use most accurate classifiers from speaker identification experiments: LogisticRegression_liblinear and LogisticRegression_lbfgs. Also try the remaining high-performing classifiers: MLPClassifier, SGDClassifier_squared_hinge - pre-process the training data by standardizing features using StandardScalar() from scikit-learn - test different number of training samples (1000 and 2000) vectors to compare accuracy - note reported gender data for each speaker (since speaker identification controlled for this, perhaps speaker verification should at least note this statistic too?) **Purpose:** speaker verification probing task. --> does the XLS-R model encode a dimension of speaker identity? If so, at what layers? How does this differ across languages? **Data:** GlobalPhone audio .wav files **Experiment setup:** - 2000 samples total = 1000 same-speaker, 1000 different-speaker utterance pairs - classify utterance pair as same/different {1,0} - 10 unique speakers for training, 10 for evaluation (speaker ID not necessary for determining same or different...) - same size train and eval set (pairs in matrix X) **Languages:** start with German, then French, ... **Metrics and variables:** - Use 10 speakers for training, 10 speakers for evaluation - For each speaker, collect 50 utterances (.wav files) - each speaker varies in # of .wav utterances [for German: 50 to 204] --> choose min utterances as amount to use - X = matrix of 2000 combined (u,v) utterance vectors, where u,v are from the same speaker or different speakers; y = single dimension vector of binary labels {1,0} for {same,different} - Training: 1000 same speaker audio pairs, 1000 different speaker audio pairs - Selecting utterances: - 1000 same pairs / 10 speakers in training = *100 same pairs per speaker (where each pair can contain repeated audio from other pairs)* - 1000 different pairs * 2 speakers per pair = 2000 speaker slots. 2000 speaker slots / 10 speakers = 200 slots per speaker. - --> combine randomly selected utterances per speaker into 1000 unique u,v pairs - Verification accuracy = correct label {0,1} assigned to each vector pair **Classifiers:** from scikit-learn. Including [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) with "liblinear" and lbfgs" algorithms. **Notes:** - plot accuracy for each layer - make experiment code more modular for easy changing and scaling - when working system with enough data figured out, apply to more languages - plot all languages to see layer accuracy as shown in paper as guide. Compare to (Chen et al., 2022) - ![chen-et-al_2022_fig2](https://hackmd.io/_uploads/SyPAnkHph.png) ### German: Results Notes - selected speaker ids: train_sp = [010, ... , 019], eval_sp = [020, ... , 029] - All German speaker reported metadata: - male: 001-007, 009, 012-017, 019-025, 027-045, 047, 048, 050-077 - female: 008, 010, 011, 018, 026, 046, 049 - train: {"female": 3, "male": 7} - eval: {"female": 1, "male": 9} - considering there are only 6 female speakers of 77 speakers, this might be an okay distribution. --> *Q: What would happen if trained on only male or female speakers (so that gender is not used to determine same/diff in any case) ?* #### Experiment 1: train on 1000 samples ((u,v) pairs in X) ``` classifier | max layer accuracy | at layer(s) | accuracy per layer Experiment: German, 1000 samples - MLPClassifier: 0.886 | [3] [0.75, 0.812, 0.886, 0.8805, 0.8115, 0.8785, 0.849, 0.8195, 0.8525, 0.7775, 0.793, 0.764, 0.762, 0.761, 0.758, 0.758, 0.7865, 0.776, 0.7905, 0.808, 0.82, 0.7855, 0.7805, 0.752] - LogisticRegression_lbfgs: 0.814 | [6] [0.692, 0.79, 0.785, 0.787, 0.778, 0.814, 0.7615, 0.774, 0.784, 0.723, 0.7555, 0.7225, 0.712, 0.7165, 0.724, 0.7125, 0.74, 0.746, 0.73, 0.7735, 0.7565, 0.751, 0.7245, 0.69] - LogisticRegression_liblinear: 0.793 | [6] [0.6785, 0.755, 0.756, 0.738, 0.7415, 0.793, 0.7275, 0.745, 0.7615, 0.71, 0.742, 0.7115, 0.7065, 0.709, 0.718, 0.7035, 0.7255, 0.724, 0.6965, 0.743, 0.725, 0.7135, 0.7055, 0.683] - SGDClassifier_squared_hinge: 0.851 | [4] [0.683, 0.812, 0.8305, 0.851, 0.8475, 0.819, 0.801, 0.7965, 0.7915, 0.781, 0.758, 0.7445, 0.7385, 0.736, 0.7285, 0.7295, 0.747, 0.7925, 0.805, 0.8185, 0.8335, 0.774, 0.7615, 0.704] ``` #### Experiment 2: train on 2000 samples => highest accuracy ``` Experiment: German, 2000 samples - MLPClassifier: 0.911 | [6] [0.7785, 0.8615, 0.8875, 0.8915, 0.8275, 0.911, 0.879, 0.8435, 0.83, 0.783, 0.801, 0.7525, 0.784, 0.779, 0.776, 0.755, 0.7905, 0.79, 0.859, 0.8475, 0.85, 0.8235, 0.8235, 0.7875] - LogisticRegression_lbfgs: 0.8535 | [6] [0.711, 0.7885, 0.8085, 0.811, 0.794, 0.8535, 0.8065, 0.798, 0.769, 0.7445, 0.7895, 0.748, 0.753, 0.744, 0.7435, 0.7345, 0.757, 0.7555, 0.822, 0.8335, 0.827, 0.8025, 0.7565, 0.702] - LogisticRegression_liblinear: 0.8145 | [6] [0.7025, 0.7255, 0.772, 0.763, 0.7535, 0.8145, 0.7725, 0.77, 0.743, 0.7305, 0.7785, 0.728, 0.737, 0.7335, 0.7365, 0.7205, 0.732, 0.739, 0.785, 0.812, 0.8075, 0.775, 0.7415, 0.695] - SGDClassifier_squared_hinge: 0.88 | [6] [0.741, 0.833, 0.835, 0.8335, 0.8635, 0.88, 0.875, 0.8105, 0.833, 0.8065, 0.825, 0.801, 0.765, 0.7525, 0.757, 0.746, 0.799, 0.774, 0.8285, 0.8515, 0.796, 0.8225, 0.7415, 0.7405] ``` ![verification_de](https://hackmd.io/_uploads/B123ukra2.png) #### Experiment 3: train on 3000 samples => not as high as with 2000 *but, the size of the test set also increased (it should have stayed the same for all experiments of 1000/2000/3000 samples)* ``` Experiment: German, 3000 samples - MLPClassifier: 0.8876666666666667 | [6] [0.7626666666666667, 0.8753333333333333, 0.8656666666666667, 0.8583333333333333, 0.7476666666666667, 0.8876666666666667, 0.8593333333333333, 0.821, 0.8303333333333334, 0.7916666666666666, 0.8236666666666667, 0.7876666666666666, 0.7936666666666666, 0.7836666666666666, 0.7683333333333333, 0.7696666666666667, 0.8046666666666666, 0.7976666666666666, 0.8386666666666667, 0.79, 0.7886666666666666, 0.7566666666666667, 0.7823333333333333, 0.7706666666666667] - LogisticRegression_lbfgs: 0.8293333333333334 | [6] [0.676, 0.8046666666666666, 0.81, 0.7726666666666666, 0.728, 0.8293333333333334, 0.8, 0.7353333333333333, 0.77, 0.7623333333333333, 0.757, 0.7333333333333333, 0.7563333333333333, 0.7683333333333333, 0.738, 0.7313333333333333, 0.737, 0.7516666666666667, 0.7663333333333333, 0.6953333333333334, 0.689, 0.707, 0.6893333333333334, 0.6703333333333333] - LogisticRegression_liblinear: 0.807 | [6] [0.6653333333333333, 0.756, 0.795, 0.71, 0.678, 0.807, 0.7716666666666666, 0.7173333333333334, 0.7496666666666667, 0.751, 0.7513333333333333, 0.716, 0.7496666666666667, 0.7593333333333333, 0.733, 0.7226666666666667, 0.7253333333333334, 0.7426666666666667, 0.7436666666666667, 0.657, 0.6613333333333333, 0.6786666666666666, 0.6576666666666666, 0.6583333333333333] - SGDClassifier_squared_hinge: 0.861 | [2] [0.7156666666666667, 0.861, 0.8506666666666667, 0.781, 0.8326666666666667, 0.8583333333333333, 0.8473333333333334, 0.842, 0.808, 0.7956666666666666, 0.7833333333333333, 0.7906666666666666, 0.763, 0.7776666666666666, 0.765, 0.7746666666666666, 0.757, 0.7673333333333333, 0.8426666666666667, 0.8193333333333334, 0.8013333333333333, 0.819, 0.7756666666666666, 0.7016666666666667] ``` ### French: Results Notes: - using the optimal experimental condition as tested with German: 2000 training samples - speaker metadata: - "SEX:male" = {"in_train": [1, 3, 7, 8, 9], "in_test": [11, 14, 16, 19, 20]} - "SEX:female" = {"in_train": [2, 4, 5, 6, 10], "in_test": [12, 13, 15, 17, 18]} ``` classifier | max layer accuracy | at layer(s) | accuracy per layer Experiment: Language: French, Num. Samples: 2000 - MLPClassifier: 0.94 | [5] [0.8585, 0.927, 0.9395, 0.9335, 0.94, 0.9355, 0.938, 0.934, 0.917, 0.901, 0.9015, 0.886, 0.8845, 0.89, 0.872, 0.888, 0.8985, 0.908, 0.927, 0.938, 0.924, 0.93, 0.8925, 0.849] - LogisticRegression_lbfgs: 0.9135 | [7] [0.823, 0.8735, 0.895, 0.905, 0.909, 0.883, 0.9135, 0.8925, 0.8725, 0.8495, 0.8375, 0.8605, 0.8545, 0.861, 0.859, 0.8675, 0.8615, 0.8705, 0.8925, 0.9095, 0.8835, 0.9045, 0.8455, 0.819] - LogisticRegression_liblinear: 0.902 | [20] [0.821, 0.834, 0.853, 0.8915, 0.8895, 0.8655, 0.9015, 0.8765, 0.8535, 0.834, 0.8175, 0.8495, 0.8455, 0.8545, 0.855, 0.862, 0.8495, 0.8525, 0.8725, 0.902, 0.865, 0.895, 0.833, 0.8165] - SGDClassifier_squared_hinge: 0.9225 | [3] [0.813, 0.891, 0.9225, 0.889, 0.916, 0.917, 0.9085, 0.9075, 0.904, 0.8935, 0.8705, 0.865, 0.859, 0.8685, 0.844, 0.866, 0.8675, 0.891, 0.9015, 0.9195, 0.9085, 0.898, 0.846, 0.8215] ``` ![verification_fr](https://hackmd.io/_uploads/r1aZu4Ha2.png) ### Czech: Results Notes: - speaker metadata: - "SEX:male" = {"in_train": [5, 7, 8, 9, 10], "in_test": [12, 13, 15, 16, 19]} - "SEX:female" = {"in_train": [1, 2, 3, 4, 6], "in_test": [11, 14, 17, 18, 20]} ``` classifier | max layer accuracy | at layer(s) | accuracy per layer Experiment: Language: Czech, Num. Samples: 2000 - MLPClassifier: 0.9265 | [5] [0.8425, 0.8985, 0.9105, 0.9095, 0.9265, 0.9215, 0.9105, 0.9135, 0.8685, 0.865, 0.847, 0.837, 0.8175, 0.8085, 0.819, 0.8195, 0.8465, 0.8755, 0.903, 0.9205, 0.91, 0.9035, 0.8715, 0.835] - LogisticRegression_lbfgs: 0.9005 | [5] [0.814, 0.8405, 0.876, 0.894, 0.9005, 0.8935, 0.8535, 0.8705, 0.831, 0.8225, 0.799, 0.8035, 0.7935, 0.78, 0.794, 0.786, 0.7935, 0.8365, 0.857, 0.861, 0.8415, 0.85, 0.8155, 0.799] - LogisticRegression_liblinear: 0.8925 | [5] [0.8095, 0.8315, 0.8525, 0.876, 0.8925, 0.873, 0.82, 0.8535, 0.814, 0.8035, 0.7785, 0.793, 0.782, 0.7675, 0.774, 0.777, 0.794, 0.808, 0.827, 0.8625, 0.8335, 0.828, 0.807, 0.7975] - SGDClassifier_squared_hinge: 0.9085 | [6] [0.814, 0.85, 0.892, 0.8995, 0.888, 0.9085, 0.8775, 0.872, 0.872, 0.837, 0.8325, 0.823, 0.8155, 0.812, 0.803, 0.7765, 0.8055, 0.832, 0.8705, 0.8395, 0.8375, 0.851, 0.838, 0.8005] ``` ![verification_cz](https://hackmd.io/_uploads/SJ0HUHSa2.png) ### Polish: Results Notes: - speaker metadata: - "SEX:male" = {"in_train": [2, 3, 6, 7, 8], "in_test": [12, 13, 15, 18, 19, 20]} - "SEX:female" = {"in_train": [1, 4, 5, 9, 10], "in_test": [11, 14, 16, 17]} ``` classifier | max layer accuracy | at layer(s) | accuracy per layer Experiment: Language: Polish, Num. Samples: 2000 - MLPClassifier: 0.871 | [7] [0.7445, 0.85, 0.7985, 0.8435, 0.804, 0.8475, 0.871, 0.848, 0.83, 0.8125, 0.842, 0.8095, 0.787, 0.792, 0.805, 0.8115, 0.829, 0.835, 0.8365, 0.844, 0.818, 0.812, 0.7925, 0.7485] - LogisticRegression_lbfgs: 0.823 | [4] [0.7055, 0.8165, 0.7655, 0.823, 0.75, 0.7915, 0.8075, 0.7885, 0.74, 0.737, 0.7705, 0.719, 0.725, 0.7205, 0.763, 0.757, 0.74, 0.741, 0.712, 0.779, 0.767, 0.7805, 0.7025, 0.707] - LogisticRegression_liblinear: 0.8 | [4] [0.695, 0.795, 0.743, 0.8, 0.7045, 0.7595, 0.7755, 0.766, 0.7145, 0.7245, 0.75, 0.704, 0.709, 0.712, 0.7575, 0.744, 0.7235, 0.714, 0.693, 0.759, 0.75, 0.7585, 0.6795, 0.697] - SGDClassifier_squared_hinge: 0.86 | [2] [0.7165, 0.86, 0.7915, 0.8335, 0.807, 0.8515, 0.8445, 0.818, 0.8, 0.742, 0.7765, 0.7325, 0.73, 0.7535, 0.7675, 0.76, 0.799, 0.7745, 0.7815, 0.801, 0.805, 0.8015, 0.72, 0.7115] ``` ![verification-pl.png](https://hackmd.io/_uploads/ryi224B0h.png) ___ ### Within-Language Results Compared #### MLPClassifier (highest performing classifier, non-linear) | layer | German | French | Czech | Polish | | ------:| ---------:| ---------:| ----------:| ----------:| | 1 | 0.7785 | 0.8585 | 0.8425 | 0.7445 | | **2** | 0.8615 | 0.927 | 0.8985 | **0.85** | | 3 | 0.8875 | 0.9395 | 0.9105 | 0.7985 | | 4 | 0.8915 | 0.9335 | 0.9095 | 0.8435 | | **5** | 0.8275 | **0.94** | **0.9265** | 0.804 | | **6** | **0.911** | 0.9355 | 0.9215 | **0.8475** | | **7** | **0.879** | **0.938** | 0.9105 | 0.871 | | 8 | 0.8435 | 0.934 | 0.9135 | **0.848** | | 9 | 0.83 | 0.917 | 0.8685 | 0.83 | | 10 | 0.783 | 0.901 | 0.865 | 0.8125 | | 11 | 0.801 | 0.9015 | 0.847 | 0.842 | | 12 | 0.7525 | 0.886 | 0.837 | 0.8095 | | 13 | 0.784 | 0.8845 | 0.8175 | 0.787 | | 14 | 0.779 | 0.89 | 0.8085 | 0.792 | | 15 | 0.776 | 0.872 | 0.819 | 0.805 | | 16 | 0.755 | 0.888 | 0.8195 | 0.8115 | | 17 | 0.7905 | 0.8985 | 0.8465 | 0.829 | | 18 | 0.79 | 0.908 | 0.8755 | 0.835 | | **19** | **0.859** | 0.927 | 0.903 | 0.8365 | | **20** | 0.8475 | **0.938** | **0.9205** | **0.844** | | 21 | 0.85 | 0.924 | 0.91 | 0.818 | | 22 | 0.8235 | 0.93 | 0.9035 | 0.812 | | 23 | 0.8235 | 0.8925 | 0.8715 | 0.7925 | | 24 | 0.7875 | 0.849 | 0.835 | 0.7485 | ![verification_de_fr_cz](https://hackmd.io/_uploads/Hk3guSHa2.png) #### SDGClassifier using squared hinge (highest performing linear classifier) | layer | German | French | Czech | Polish | | ------:| --------------------------------------------:| --------------------------------------------:| --------------------------------------------:| --- | | 1 | 0.7765 | 0.8015 | 0.7845 | 0.7165 | | 2 | 0.8435 | 0.874 | 0.878 | <span style="color: blue;">**0.86**</span> | | 3 | 0.8215 | 0.865 | 0.875 | 0.7915 | | **4** | **0.8955** | <span style="color: blue;">**0.9125**</span> | 0.885 | **0.8335** | | 5 | 0.882 | **0.883** | 0.8815 | **0.807** | | **6** | 0.8895 | **0.87** | <span style="color: blue;">**0.9015**</span> | **0.8515** | | 7 | 0.863 | 0.857 | 0.8925 | **0.8445** | | 8 | 0.845 | 0.837 | 0.879 | **0.818** | | 9 | 0.8355 | 0.851 | 0.879 | 0.8 | | 10 | 0.8125 | 0.8395 | 0.863 | 0.742 | | 11 | 0.804 | 0.8315 | 0.8695 | 0.7765 | | 12 | 0.791 | 0.828 | 0.861 | 0.7325 | | 13 | 0.7935 | 0.84 | 0.8165 | 0.73 | | 14 | 0.7835 | 0.8525 | 0.821 | 0.7535 | | 15 | 0.788 | 0.82 | 0.818 | 0.7675 | | 16 | 0.778 | 0.8245 | 0.8275 | 0.76 | | 17 | 0.7905 | 0.8515 | 0.8445 | 0.799 | | 18 | 0.8005 | 0.806 | 0.865 | 0.7745 | | 19 | 0.867 | **0.87** | 0.844 | 0.7815 | | **20** | <span style="color: blue;">**0.8965**</span> | 0.8545 | 0.8945 | 0.801 | | 21 | 0.849 | 0.845 | **0.895** | **0.805** | | 22 | 0.8095 | 0.8055 | 0.889 | 0.8015 | | 23 | 0.815 | 0.827 | 0.846 | 0.72 | | 24 | 0.7805 | 0.779 | 0.763 | 0.7115 | ![verification_de_fr_cz_pl](https://hackmd.io/_uploads/rJs5eHBRn.png) ___ ### Conclusions (in-progress) German: - best results with **2000 training samples** (vs. 1000 or 3000) - peak accuracy at **layer 6** for all 4 classifiers, with an additional peak around **layers 19-20**. - ==> approximately consistent (in terms of having 2 peaks at lower and higher layers) with results found in Chen et. al., where layers 1-4 of XLSR model and 17-19 contained the highest normalized weight values, or contribution to the encoding of speaker-related information - most accurate classifier: **MLPClassifier**, with highest layer accuracy of **0.911** - note that this classifier performed with the lowest accuracy (although still above .97) in the speaker identification task, compared to the other classifiers used --- ## Cross-Lingual Transfer *in speaker verification probing task* - test speaker verification in more languages --> Polish [done] - test cross-lingual transfer (training in one language, testing in another) - determine how to build a weighted classifier for each layer - build a validation set for each language and gather embeddings [done] **Probing task:** classify a pair of utterances as being from the same speaker or different speakers. **Purpose:** evaluate cross-lingual transfer of encoded dimension of speaker identity. **Output:** Plot classifier accuracy per language tested given the training language. **XLS-R model:** used to retrieve utterance feature representations. **Languages:** German (de), French (fr), Czech (cz), Polish (pl) **Metrics and variables:** - For each language: a training set of 2000 samples of same or different-speaker (u,v) utterance pairs, and a testing set of the same form with different samples from the test speakers - Each train/test set contains embedded .wav audio from 10 speakers x 50 utterances per speaker - Use 10 speakers for training, 10 speakers for evaluation, per language - Verification accuracy = correct label {0,1} assigned to each vector pair out of all pairs - (See [previous](#Speaker-Verification) experiment for more design information) **Classifiers:** using best-performing (highest accuracy) classifiers from [Speaker Verification](#Speaker-Verification) experiment. Available from scikit-learn: - MLPClassifier and SGDClassifier using squared_hinge loss. **Method:** - Train classifiers in each language using embeddings from the highest-accuracy test layer for the given language, as revealed in within-language speaker verification experiments that used the SGDClassifier. - Chosen layers for training: - de: 4, fr: 4, cz: 6, pl:2 - Test each classifier on each layer of each test language dataset. - For each test language, plot the maximum layer accuracy achieved by each trained classifier. ### Results ![cross-lingual-de-fr-cz-pl-v2](https://hackmd.io/_uploads/BJlviEyIp.png) (max accuracy layer bolded when layer aligns with training layer) | test_language | train_language | clf_name | max_accuracy | train_layer_acc | within_layer_acc | max_acc_layer | |:------------- |:-------------- |:--------------------------- | ------------:| ---------------:| ----------------:| -------------:| | CZ | CZ | MLPClassifier | 0.8835 | 0.8835 | 0.8835 | **6** | | CZ | CZ | SGDClassifier_squared_hinge | 0.8315 | 0.778 | 0.778 | 2 | | CZ | DE | MLPClassifier | 0.883 | 0.7315 | 0.883 | 6 | | CZ | DE | SGDClassifier_squared_hinge | 0.8345 | 0.7325 | 0.5 | 3 | | CZ | FR | MLPClassifier | 0.8515 | 0.8235 | 0.8365 | 3 | | CZ | FR | SGDClassifier_squared_hinge | 0.7645 | 0.7495 | 0.5 | 3 | | CZ | PL | MLPClassifier | 0.7435 | 0.7435 | 0.5005 | **2** | | CZ | PL | SGDClassifier_squared_hinge | 0.727 | 0.727 | 0.5 | **2** | | DE | CZ | MLPClassifier | 0.67 | 0.654 | 0.5545 | 3 | | DE | CZ | SGDClassifier_squared_hinge | 0.725 | 0.725 | 0.6495 | **6** | | DE | DE | MLPClassifier | 0.835 | 0.835 | 0.835 | **4** | | DE | DE | SGDClassifier_squared_hinge | 0.7845 | 0.7845 | 0.7845 | **4** | | DE | FR | MLPClassifier | 0.8155 | 0.7485 | 0.7485 | 2 | | DE | FR | SGDClassifier_squared_hinge | 0.693 | 0.693 | 0.693 | **4** | | DE | PL | MLPClassifier | 0.6225 | 0.6225 | 0.5 | **2** | | DE | PL | SGDClassifier_squared_hinge | 0.6095 | 0.6095 | 0.5 | **2** | | FR | CZ | MLPClassifier | 0.8565 | 0.831 | 0.806 | 3 | | FR | CZ | SGDClassifier_squared_hinge | 0.9105 | 0.8725 | 0.8135 | 3 | | FR | DE | MLPClassifier | 0.9295 | 0.9295 | 0.9295 | **4** | | FR | DE | SGDClassifier_squared_hinge | 0.8665 | 0.8665 | 0.8665 | **4** | | FR | FR | MLPClassifier | 0.931 | 0.931 | 0.931 | **4** | | FR | FR | SGDClassifier_squared_hinge | 0.7785 | 0.7785 | 0.7785 | **4** | | FR | PL | MLPClassifier | 0.7195 | 0.613 | 0.5 | 19 | | FR | PL | SGDClassifier_squared_hinge | 0.6375 | 0.6375 | 0.5 | **2** | | PL | CZ | MLPClassifier | 0.81 | 0.7895 | 0.8055 | 3 | | PL | CZ | SGDClassifier_squared_hinge | 0.8495 | 0.805 | 0.8495 | **2** | | PL | DE | MLPClassifier | 0.851 | 0.748 | 0.8355 | 3 | | PL | DE | SGDClassifier_squared_hinge | 0.786 | 0.786 | 0.7825 | **4** | | PL | FR | MLPClassifier | 0.861 | 0.842 | 0.861 | 2 | | PL | FR | SGDClassifier_squared_hinge | 0.8075 | 0.8075 | 0.7565 | **4** | | PL | PL | MLPClassifier | 0.8085 | 0.8085 | 0.8085 | **2** | | PL | PL | SGDClassifier_squared_hinge | 0.827 | 0.827 | 0.827 | **2** | ![cross-lingual-avg-train](https://hackmd.io/_uploads/BJ1j2VyUT.png) ![cross-lingual-avg-clf](https://hackmd.io/_uploads/HkcsnE1IT.png) ### Discussion **Analysis** - The MLP classifier outperforms the SGD classifier on average for all test languages. - For all test languages except Polish (PL), the MLP classifier trained on data from the same language as the test language performs the best. - For PL, performance is highest when trained on FR, then DE, then CZ, then PL. However, the PL-trained MLP classifier still performs best on the PL test language compared to all other test languages (PL > CZ > FR > DE), suggesting that while the classifiers trained on other languages might be better models, the MLP classifier trained on PL itself is suited best for evaluating on PL. This result would be expected assuming the classifier encodes some information that is language-specific when training. - Ranked languages from the MLP classifier performance are as follows for each test language: > | Test Language | Performance Ranking of Trained Classifiers | > | ------------- | ------------------------------------------ | > | DE | DE > FR > CZ > PL | > | FR | FR > DE > CZ > PL | > | CZ | CZ > DE > FR > PL | > | PL | FR > DE > CZ > PL | **Questions** - Can the ranked classifier performance be interpreted as indicating a scale of language similarity? - If so, how can this scale be verified or again validated? - normalizing capped ability: - If so, should normalization factors of maximum performance ability for a classifier be taken into consideration? (consider PL-trained MLP, which ranks lowest in all cases due to lower performance relative to all MLP classifiers.) - (same as above question) can the classifiers be accurately ranked for relative language differences given different capped performance abilities? - what modulates: - If ranked classifier performance on a cross-lingual speaker verification task does indicate a scale of language similarity, what type or measure of similarity is reflected? - What relationship between the training and testing language, if any, modulates ranked classifier performance on a cross-lingual speaker verification task? **What can be explored further** - Plotting only the accuracy on the same layer as the training layer, then comparing results - Using a weighted classifier or fine-tuning (see next section) in order to capture relevant information from all layer embeddings during training and testing - Training and testing against more languages ___ ## Using a Weighted Classifier *or* Fine-Tuning https://machinelearningmastery.com/weighted-average-ensemble-for-deep-learning-neural-networks/ https://huggingface.co/docs/transformers/training#fine-tune-a-pretrained-model ___ # Context ## Situating the Investigation (before I understood interpretability research) - in multilingual self-supervised transformer models (such as: X-LSR, wav2vec 2.0) - are speaker identity and phone identity encoded in a language-independent way? - Research on speaker and phonetic info in an English-language trained model (trained on English LibriSpeech): From (Lieu et al. 2023, "Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces"): "In analyses of SSL speech representations based on predictive coding models, we showed that speaker information and phonetic information are encoded in orthogonal dimensions of the representation space, indicating that these models are implicitly disentangling the two sources of information." - Lieu et al. use this finding to normalize speaker information and produce better results in ABX phone discrimination tasks. - --> - (1) To what extent does encoded phonetic information influence speaker discrimination, and could this be revealed in tasks tested on other languages? - (2) Is this information similarly encoded in models trained on other languages? - is there a common dimension for speaker identity in models encoding speech representations? - how does learning the dimension for speaker verification in one language affect transfer abilities to other languages? - given emperical evidence in human speech perception of speaker discrimination affected by native-language similarity (e.g. the Language Familiarity Effect (LFE) (de Seyssel et al., 2022)), can similar effects be observed in probing tasks using these models? - if so, what measure of "language similarity" is pertinent to the transfer abilities of the model? **Other Considerations:** - Is this thesis identifying and solving a problem? - Is this thesis identifying an unknown area and choosing a specific approach to investigating that area? - e.g. "Towards better understanding of the underlying mechanism of multilingual modeling,..." (Gonen et al., 2022, “Analyzing Gender Representation in Multilingual Models”) and how dimensions or information are shared across languages - Is there a way to incorporate a language similarity measure that considers surprisal on continuous speech input, in a way that is comparable across multiple languages? - Pimental et al. (2020) compare "phonotactic complexity" aka bits per phoneme using within-language measures, notes that the measure can be used for "cross-linguistic comparison" but uses a phoneme-based encoding scheme that maps sound onto existing characteristics. Would this capture the important characteristics of a stream of speech that are available and represented in embeddings processed by ASR models? https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00296/43538/Phonotactic-Complexity-and-Its-Trade-offs - possible example or info on there being other notable markers of speech aside from categorical phoneme segmentation: Malisz et al. (2018) assess the relationship between surprisal and prosody on phonetic encoding (the way that the speech is produced).