###### tags: `PaperReview` # An evaluation of word-level confidence estimation for end-to-end automatic speech recognition > University of POLITEHNICA of Bucharest, Romania > Technical University of Cluj-Napoca, Romania > SLT 2021 ## Introduction - In the context of automatic speech recognition (ASR) **confidence estimation can be of crucial importance** for many end-user applications. - But there are **two main challenges** to developing confidence scoring methods for ASR systems: the **structured output and the granular predictions** (e.g., tokens or graphemes versus words). - Means that **ASR is seq2seq mapping** (not like image classification) - And also granular means **predicting piece by piece** (token prediction) - So they proposed: - Several **state-of-the-art uncertainty estimation methods** to the end-to-end ASR pipeline - **Aggregation techniques** to obtain user-relevant confidence estimates (i.e. word-level) ## Related Works - **Confidence scoring for speech recognition**: - Most prior work on confidence scoring for ASR targets classical systems based on the **HMM-GMM paradigm**. - This method **extract a set of features from decoding lattice (log-likelihood, language model score, etc)** then **train a classifier** to predict whether the transcription is correct or not. - **Confidence scoring in end-to-end systems**: - The baseline method for confidence estimation in neural networks is to use **directly the probability of the most-likely prediction**. - **Temperature scaling**: Scale the predictions with temperature $\mathcal{T}$ - **Dropout during testing**: inference several times with different dropout and average the predictions. - **Deep ensemble**: average the predictions of multiple models. ## Methodology - Consider a seq2seq model that maps an **audio sequence** $\textbf{a}$ to a **sequence of tokens** $\textbf{t} = (t_1, \cdots,t_T)$, **model with parameter** $\theta$ which is trained by minimizing losses such as CTC or KL divergence. - **Probability of next token of the output** is given by $p(t_k | \hat{\textbf{t}}_{\lt k},\textbf{a};\theta)$, which will be **used for performing beam search** to obtain the most likely sequence of tokens. - Note that the **conditioned output probability is a distribution over the** $V$ **tokens** on the vocabulary, which will be denoted as $V$-dimensional vector $\textbf{p}_k$ - Their main assumption is **the availability of a probability distribution over each token** ### Confidence Estimation - Achieved with **two steps**: - Using the posterior probabilities at each timestep $\textbf{p}_k$, **extract features to encode the confidence score** of each token $s_k^{(t)}$ - **Aggregate token-level scores into word-level** confidence scores $s_j^{(w)}$, based on the **word boundaries**. - **Feature Extraction.** using two variants to measure the confidence at token level. - **Log probability** (log-proba) $\rightarrow s^{(t)}=\text{log max }\textbf{p}$ - **Negative entropy** (neg-entropy) $\rightarrow s^{(t)}=\textbf{p}^\top \text{log } \textbf{p}$ - **Aggregation.** To obtain word-level features from token-level one, three methods were proposed: **sum, average, and minimum**. - **Sum can be desirable** since all the features are negative, and **smaller value means lower confidence**. - **Minimum** implies that the **confidence should be low if there exist at least one token that has low confidence**. ![image](https://hackmd.io/_uploads/HkdGmNY3p.png) ### Improving the Token Probabilities - **Temperature Scaling** - Define a temperature variable $\tau$, the temperature scaling will have the following formula: $$ \textbf{p}'_k = \text{softmax}(\text{log }\textbf{p}_k/\tau) $$ - Then **extract** $s^{(t)}$ on the updated probabilities $\textbf{p}'$, **aggregate** them into the word-level score $s^{(w)}$, and **clarify** the word as either correct or incorrect: $$ P(\text{correct}) = \sigma(\alpha \cdot s^{(w)} + \beta) $$ Where $\alpha$, $\beta$, $\tau$ are **parameters** and are learnt by optimizing the cross-entropy loss on a validation set. - **Dropout** - A technique that **masks out random parts** of the activations in a network - Dropout **induces a probability distribution over the weights** of the network and can be **consequently used for approximate Bayesian inference**. $$ \textbf{p}'_k = \frac{1}{N}\sum_n\hat{\textbf{p}}_k $$ where $\hat{\textbf{p}}$ specifies the dropout prediction. - In this paper $\textbf{p}'$ **is used to extract both log-proba and neg-entropy features**. - **Ensembles** - Similar concept with dropout, but instead the **weights come from independent trained networks**. $$ \textbf{p}'_k = \frac{1}{N}\sum_np(t_k | \hat{\textbf{t}}_{\lt k},\textbf{a};\theta_n) $$ Where $\left\{ \theta_n \right\}_{n=1}^N$ specifies the ensemble of the models. - Note that these **three approaches can be combined**. - First update the probabilities using temperature scaling - Then average it using dropout - Then create multiple of these models to use ensemble principle ## Experiment ### Dataset - **Librispeech, TED-LIUM, and CommonVoice** - **10%** of the dataset are chosen to be the dev and test set respectively ![image](https://hackmd.io/_uploads/ByvttPtha.png) ### ASR Systems - Use **AED-based ASR model pretrained from LibriSpeech** provided by ESPnet. - Train another **4 models with same structure but different seeds to apply ensembles**. ### Evaluation Metrics - **Confidence score should be correlated with the accuracy** of the transciption. - Use **Area Under Precision Recall curve (AUPR)**, and **Area Under Receiver Operating Characteristic curve (AUROC)** - **AUPR can have its variant depending on what the focus is**. So there will be **two AUPR**: **AUPR$e$** (when errors are treated as positive) and **AUPR$s$** (when correct words are treated as positive) - Note that **calibration is not evaluated**, because the methodology here is **not designed to necessarily yield a probability**, but a **score that is correlated with the label**. ## Result ### Features and Aggregation ![image](https://hackmd.io/_uploads/BJJlawtn6.png) - Comparison of features: - **log-proba is better** compare to neg-entropy for most cases. - Comparison of aggregations: - **sum aggregation is better for log-proba** while **min aggregation is better for neg-entropy**. Implying sum and min are comparably good. - **Averaging is generally underperforming** for both features, suggesting that **length-invariant measures are detrimental* ![image](https://hackmd.io/_uploads/SyeM1Oth6.png) **Y-axis is error % while X-axis is length of utterance** - Comparison across datasets: - **in-domain has the best performance** (2.7% WER on Libri clean and 6.0% on Libri other) - . In each of these settings the number of words that are correctly classified changes, going from **more on the Libri splits** to **fewer on TED and CommonVoice**, **similar to the observation of AUPR$e$ and AUPR$s$** ### Temperature Scaling and Dropout ![image](https://hackmd.io/_uploads/Bklqxdt26.png) - **log-proba** features benefit more from **dropout** - **neg-entropy** feature yield more improvements when **temperature scaling** is used - In this setup the **best result was obtained from neg-entropy with sum aggregation (row 16)** ![image](https://hackmd.io/_uploads/Bk4Gb_YhT.png) ### Ensembles ![image](https://hackmd.io/_uploads/rJJqZuY3p.png) - The rows that does not use ensemble, the **different models were evaluated separately and average the result**. - pretrained model (table 3 row 13) has generally better performance compare to the retrained ones (table 4 row 1), suggesting that **predictive performance of a model can correlate with its confidence scoring performance**. - Temperature scaling gives the largest performance boost on all three metrics (row 2), while dropout improvse only the AUPR$s$. - Combining these **temperature scaling and ensemble obtained better performance** ## Conclusion - The paper introduces an approach for word-level confidence scoring in end-to-end speech recognition systems. - A thorough ablation study was conducted on features and their aggregation, using three well-known speech databases (LibriSpeech, TED-LIUM, and CommonVoice). - The study also evaluated improved methods that modify token probabilities and their combinations. - The main observation indicates that temperature scaling enhances uncertainty features such as log-probabilities and neg-entropy, as well as other methods like dropout and ensemble. - Using a pre-trained model enhances replicability and facilitates comparisons with future confidence scoring methods utilizing the same ASR (Automatic Speech Recognition). - The approach aims for simplicity by utilizing a compact feature set based on readily-available token posteriors. - Future work is proposed to consider augmenting these features with complementary information, such as token duration extraction from attention.