# Multilabel considerations ###### tags: `Definitions and References` ## Definition - $X$ is a $d-dimensional$ input of numerical or categorical features. - $L = {λ_1, λ_2, ..., λ_q}$ an output of $q$ labels, $q>1$. Each subset of $L$ is called labelset. - $(x,Y)$, where $x = (x_1, . . . , x_d)∈X$ , is a $d-dimensional$ instance which has a set of labels associated $Y⊆L$. Label associations can be also represented as a $q-dimensional$ binary vector $y=(y_1, . . . , y_q)=\{0, 1\}^q$ where each element is 1 if the label is relevant and 0 otherwise. ![](https://i.imgur.com/7E7IkTk.png) ## Evaluation metrics The multilabel dataset characteristics, as well as the quality of multilabel classifiers, are evaluated using different metrics. The main focus is to determine the behaviour of the dataset and the classification itself. Those metrics will fall in dataset characterization and evaluation metrics [^f_2] ### Dataset metrics To characterize the dataset, two measures are frequently used: 1. **Label Cardinality (LC)** average number of single labels associated with each example $LC(D)=\frac{1}{\left | D \right |}\sum_{i=1}^{\left | D \right |}\left | {Y_{i}} \right |$ 2. **Label Density (LD)** the normalized cardinality $LD(D)=\frac{1}{\left | D \right |}\sum_{i=1}^{|D|}\frac{|Y_{i}|}{|L|}=\frac{LC(D)}{|L|}$ ### Evaluation metrics The evaluation of singlelabel classifiers has only two possible values: **correct** or **incorrect**. But, the evaluation of multilabel classifiers may also take into account partially correct classification. A set of multilabel evaluation metrics is presented, which one use different criteria to evaluate the partial correction of multilabel classifiers. Divided in [^f_1]: 1. **Example based** They consider the difference between $Y_i$ (the true multilabels), and $Z_i$ (the predicted multilabels). All these performance measures range in the interval [0, 1]. - **Hamming Loss**[^fm_4] Evaluates how many times, on average, an example-label pair is misclassified, so, is the fraction of labels that are incorrectly predicted. $HammingLoss=\frac{1}{t}\sum_{i=1}^{t} \frac{1}{q}\left | Z_i\Delta Y_i \right |$ where: $\Delta$ is the symmetric difference of two sets $\frac{1}{q}$ is the factor to obtain a normalized value in [0, 1]. > [color=#351de5] The smaller the value, the better the multi-label classifier performance is. 2. **Label based metrics** It is also common to use a group of example-based metrics from the information retrieval area and addressed them to the performance per label. Any binary evaluation metric can be used with this type of approach, commonly the **precision, recall, accuracy and F1-score**. *Literally from* [^f_1] The idea is computing a singlelabel metric for each label based on the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). Due to the fact of having several labels per pattern there will be a contingency table for each label, so it is necessary to compute an average value. Two different approaches can be used: **macro** and **micro**. Let $B$ be a binary evaluation measure, the macro approach computes one metric for each label and then the values are averaged over all the categories: $B_{macro}=\frac{1}{q}\sum_{i=1}^{q} B(TP_i, FP_i, TN_i, FN_i))$ the micro approach considers predictions from all instances together (aggregating the values of all the contingency tables) and then calculates the measure across all labels: $B_{micro}=B(\sum_{i=1}^{q} TP_i, \sum_{i=1}^{q}FP_i, \sum_{i=1}^{q}TN_i,\sum_{i=1}^{q} FN_i)$ > [color=#351de5] Macro averaged scores give equal weight to every category, regardless of its frequency (per-category averaging) and is more influenced by the performance on rare categories.This approach would be better when the system is required to perform consistently across all classes regardless the frequency of the class. > [color=#351de5]Micro averaged scores give equal weight to every example (per-example averaging) and tend to be dominated by the performance in most common categories and may be better if the density of the class is important. We will use **precision**, **recall** and **f1-score** since they're are the most widespread and expressful metrics for information retrieval systems [^f_3], defined as follows: - **Precision**[^fm_2] Is the proportion of labels correctly classified of the predicted positive labels. $precision=\frac{1}{t}\sum_{i=1}^{t} \frac{ \left | Z_i\bigcap Y_i \right |}{Z_i}$ - **Recall**[^fm_3] Is the fraction of predicted correct labels of the actual labels $recall=\frac{1}{t}\sum_{i=1}^{t} \frac{ \left | Z_i\bigcap Y_i \right |}{Y_i}$ - **F1_score**[^fm_1] Combines precision and recall and shows the corresponding n-harmonic mean, for the fisrt one: $F1\_score=\frac{1}{t}\sum_{i=1}^{t} \frac{2 \left | Z_i\bigcap Y_i \right |}{\left |Y_i \right | + \left | Z_i \right |}$ ### Final notes If there is a need for a granular analysis of the performance of an specific metric. A simple approach to addressed is to calculate the metric batch-wise. So, as a result, you can obtain the variability in representative sectors of the predicted samples. ## Default metrics The following set of metrics will be calculated as a default option in the Conabio_ML API: - Hamming loss - Precision (with it's corresponded average) - Recall (with it's corresponded average) - F1 score (with it's corresponded average) Configured as follows: ```python "eval_config": { 'metrics_set': { ClassificationDefaults.MULTILABEL: { 'average': 'micro|macro|weighted', 'batch_ratio': float, 'thresshold': float, 'per_sample': bool } } [ . . . ] } ``` And the resulting metrics as shown: ```python { 'per_class':{ 'label1':{ 'precision': float, 'recall': float , 'f1': float }, [ . . . ] }, 'one_class' : { 'precision' : numpy.ndarray , 'recall' : numpy.ndarray , 'f1_score' : numpy.ndarray , 'hamming_loss' : numpy.ndarray # The size of the array corresponds to the batch length # configured in batch ratio, according to [Notes] } } ``` ## Notes. Note that in all explained metrics we are assuming that you have previously choose/calculate an specific classification threshold. If no threshold is set an analysis of operating characteristics should be performed. A receiver operating characteristic (ROC) curve plots the performance for **all the thresholds** of the TPR and FPR. The more the reduction of the classification threshold the more the amount of positive observations allowed, so the FP:VP ratio uprises. ![](https://i.imgur.com/jdUHId4.png) [^f_1]: [A Tutorial on Multi-Label Learning](https://www.researchgate.net/publication/270337594_a_tutorial_on_multi-label_learning) [^f_2]: [Lazy Multi-label Learning Algorithms Based on Mutuality Strategies](https://www.researchgate.net/publication/277666089_Lazy_Multi-label_Learning_Algorithms_Based_on_Mutuality_Strategies?enrichId=rgreq-62b3785fe05012162ac93a890b505365-XXX&enrichSource=Y292ZXJQYWdlOzI3NzY2NjA4OTtBUzoyMzYzOTEwMTE0NTA4ODBAMTQzMzM3MTQxNzI3NA%3D%3D&el=1_x_3&_esc=publicationCoverPdf) [^f_3]: [Evaluating Information Retrieval System Performance Based on User Preference](https://www.researchgate.net/publication/250680760_Criteria_for_Evaluating_Information_Retrieval_Systems_in_Highly_Dynamic_Environments/link/56b0553408ae8e37214d2ecf/download) [^f_4]: [Precision_recall](https://www.researchgate.net/publication/273155496_The_Precision-Recall_Plot_Is_More_Informative_than_the_ROC_Plot_When_Evaluating_Binary_Classifiers_on_Imbalanced_Datasets/link/55967b3b08ae21086d20c882/download) [^fm_1]: [F1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) [^fm_2]: [Precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html?highlight=precision#sklearn.metrics.precision_score) [^fm_3]: [Recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html?highlight=recall#sklearn.metrics.recall_score) [^fm_4]: [Hamming loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html?highlight=hamming#sklearn.metrics.hamming_loss)