###### tags: `PaperReview`
# An evaluation of word-level confidence estimation for end-to-end automatic speech recognition
> University of POLITEHNICA of Bucharest, Romania
> Technical University of Cluj-Napoca, Romania
> SLT 2021
## Introduction
- In the context of automatic speech recognition (ASR) **confidence estimation can be of crucial importance** for many end-user applications.
- But there are **two main challenges** to developing confidence scoring methods for ASR systems: the **structured output and the granular predictions** (e.g., tokens or graphemes versus words).
- Means that **ASR is seq2seq mapping** (not like image classification)
- And also granular means **predicting piece by piece** (token prediction)
- So they proposed:
- Several **state-of-the-art uncertainty estimation methods** to the end-to-end ASR pipeline
- **Aggregation techniques** to obtain user-relevant confidence estimates (i.e. word-level)
## Related Works
- **Confidence scoring for speech recognition**:
- Most prior work on confidence scoring for ASR targets classical systems based on the **HMM-GMM paradigm**.
- This method **extract a set of features from decoding lattice (log-likelihood, language model score, etc)** then **train a classifier** to predict whether the transcription is correct or not.
- **Confidence scoring in end-to-end systems**:
- The baseline method for confidence estimation in neural networks is to use **directly the probability of the most-likely prediction**.
- **Temperature scaling**: Scale the predictions with temperature $\mathcal{T}$
- **Dropout during testing**: inference several times with different dropout and average the predictions.
- **Deep ensemble**: average the predictions of multiple models.
## Methodology
- Consider a seq2seq model that maps an **audio sequence** $\textbf{a}$ to a **sequence of tokens** $\textbf{t} = (t_1, \cdots,t_T)$, **model with parameter** $\theta$ which is trained by minimizing losses such as CTC or KL divergence.
- **Probability of next token of the output** is given by $p(t_k | \hat{\textbf{t}}_{\lt k},\textbf{a};\theta)$, which will be **used for performing beam search** to obtain the most likely sequence of tokens.
- Note that the **conditioned output probability is a distribution over the** $V$ **tokens** on the vocabulary, which will be denoted as $V$-dimensional vector $\textbf{p}_k$
- Their main assumption is **the availability of a probability distribution over each token**
### Confidence Estimation
- Achieved with **two steps**:
- Using the posterior probabilities at each timestep $\textbf{p}_k$, **extract features to encode the confidence score** of each token $s_k^{(t)}$
- **Aggregate token-level scores into word-level** confidence scores $s_j^{(w)}$, based on the **word boundaries**.
- **Feature Extraction.** using two variants to measure the confidence at token level.
- **Log probability** (log-proba) $\rightarrow s^{(t)}=\text{log max }\textbf{p}$
- **Negative entropy** (neg-entropy) $\rightarrow s^{(t)}=\textbf{p}^\top \text{log } \textbf{p}$
- **Aggregation.** To obtain word-level features from token-level one, three methods were proposed: **sum, average, and minimum**.
- **Sum can be desirable** since all the features are negative, and **smaller value means lower confidence**.
- **Minimum** implies that the **confidence should be low if there exist at least one token that has low confidence**.

### Improving the Token Probabilities
- **Temperature Scaling**
- Define a temperature variable $\tau$, the temperature scaling will have the following formula:
$$
\textbf{p}'_k = \text{softmax}(\text{log }\textbf{p}_k/\tau)
$$
- Then **extract** $s^{(t)}$ on the updated probabilities $\textbf{p}'$, **aggregate** them into the word-level score $s^{(w)}$, and **clarify** the word as either correct or incorrect:
$$
P(\text{correct}) = \sigma(\alpha \cdot s^{(w)} + \beta)
$$
Where $\alpha$, $\beta$, $\tau$ are **parameters** and are learnt by optimizing the cross-entropy loss on a validation set.
- **Dropout**
- A technique that **masks out random parts** of the activations in a network
- Dropout **induces a probability distribution over the weights** of the network and can be **consequently used for approximate Bayesian inference**.
$$
\textbf{p}'_k = \frac{1}{N}\sum_n\hat{\textbf{p}}_k
$$
where $\hat{\textbf{p}}$ specifies the dropout prediction.
- In this paper $\textbf{p}'$ **is used to extract both log-proba and neg-entropy features**.
- **Ensembles**
- Similar concept with dropout, but instead the **weights come from independent trained networks**.
$$
\textbf{p}'_k = \frac{1}{N}\sum_np(t_k | \hat{\textbf{t}}_{\lt k},\textbf{a};\theta_n)
$$
Where $\left\{ \theta_n \right\}_{n=1}^N$ specifies the ensemble of the models.
- Note that these **three approaches can be combined**.
- First update the probabilities using temperature scaling
- Then average it using dropout
- Then create multiple of these models to use ensemble principle
## Experiment
### Dataset
- **Librispeech, TED-LIUM, and CommonVoice**
- **10%** of the dataset are chosen to be the dev and test set respectively

### ASR Systems
- Use **AED-based ASR model pretrained from LibriSpeech** provided by ESPnet.
- Train another **4 models with same structure but different seeds to apply ensembles**.
### Evaluation Metrics
- **Confidence score should be correlated with the accuracy** of the transciption.
- Use **Area Under Precision Recall curve (AUPR)**, and **Area Under Receiver Operating Characteristic curve (AUROC)**
- **AUPR can have its variant depending on what the focus is**. So there will be **two AUPR**: **AUPR$e$** (when errors are treated as positive) and **AUPR$s$** (when correct words are treated as positive)
- Note that **calibration is not evaluated**, because the methodology here is **not designed to necessarily yield a probability**, but a **score that is correlated with the label**.
## Result
### Features and Aggregation

- Comparison of features:
- **log-proba is better** compare to neg-entropy for most cases.
- Comparison of aggregations:
- **sum aggregation is better for log-proba** while **min aggregation is better for neg-entropy**. Implying sum and min are comparably good.
- **Averaging is generally underperforming** for both features, suggesting that **length-invariant measures are detrimental*

**Y-axis is error % while X-axis is length of utterance**
- Comparison across datasets:
- **in-domain has the best performance** (2.7% WER on Libri clean and 6.0% on Libri other)
- . In each of these settings the number of words that are correctly classified changes, going from **more on the Libri splits** to **fewer on TED and CommonVoice**, **similar to the observation of AUPR$e$ and AUPR$s$**
### Temperature Scaling and Dropout

- **log-proba** features benefit more from **dropout**
- **neg-entropy** feature yield more improvements when **temperature scaling** is used
- In this setup the **best result was obtained from neg-entropy with sum aggregation (row 16)**

### Ensembles

- The rows that does not use ensemble, the **different models were evaluated separately and average the result**.
- pretrained model (table 3 row 13) has generally better performance compare to the retrained ones (table 4 row 1), suggesting that **predictive performance of a model can correlate with its confidence scoring performance**.
- Temperature scaling gives the largest performance boost on all three metrics (row 2), while dropout improvse only the AUPR$s$.
- Combining these **temperature scaling and ensemble obtained better performance**
## Conclusion
- The paper introduces an approach for word-level confidence scoring in end-to-end speech recognition systems.
- A thorough ablation study was conducted on features and their aggregation, using three well-known speech databases (LibriSpeech, TED-LIUM, and CommonVoice).
- The study also evaluated improved methods that modify token probabilities and their combinations.
- The main observation indicates that temperature scaling enhances uncertainty features such as log-probabilities and neg-entropy, as well as other methods like dropout and ensemble.
- Using a pre-trained model enhances replicability and facilitates comparisons with future confidence scoring methods utilizing the same ASR (Automatic Speech Recognition).
- The approach aims for simplicity by utilizing a compact feature set based on readily-available token posteriors.
- Future work is proposed to consider augmenting these features with complementary information, such as token duration extraction from attention.