Common Similarity and Dissimilarity Measures for Sets (MATHAIR)

# Jaccard Index 1. *Definition*: Considering two finite sample sets $A$ and $B$, the **Jaccard index** between $A$ and $B$ is defined by: $$ J(A,B) = \dfrac{\abs{A \cap B}}{\abs{A \cup B}} = \dfrac{\abs{A \cap B}}{\abs{A} + \abs{B} - \abs{A \cap B}} $$ *'The Jaccard is widely used in computer science, ecology, genomics, and other sciences, where binary or binarized data are used.'* - [Wikipedia](https://en.wikipedia.org/wiki/Jaccard_index#Overview) 2. *Property*: By design, the Jaccard index between two sets is symmetric and has its value inside the set $[0,1]$. 3. *Meaning*: The Jaccard coefficient measures similarity between finite sample sets. 4. *Variant 1*: We can measure the *dissimilarity* between two finite sets $A$ and $B$ by considering the **Jaccard distance**, defined by: $$ d_J(A,B) = 1 - J(A,B) = \dfrac{\abs{A \cup B} - \abs{A \cap B}}{\abs{A \cup B}} $$ 5. *Usage*: Jaccard distance is commony used to calculate an $n \times n$ matrix as the input for clustering and multidimensional scaling (MDS) of $n$ sample sets. 6. *Variant 2*: if ${\bf x}=(x_1,x_2,...,x_n)$ and ${\bf y}=(y_1,y_2,...,y_n)$ are two vectors with all real $x_i, y_i \geq 0$, then their (weighted) Jaccard similiarity coefficient is defined as: $$J_\mathcal{W}({\bf x}, {\bf y}) = \dfrac{\sum_{i} \min(x_i, y_i)}{\sum_i \max(x_i, y_i)}$$ Similarly, the (weighted) Jaccard distance is $d_{J \mathcal{W}}({\bf x}, {\bf y}) = 1 - J_\mathcal{W}({\bf x}, {\bf y})$. # Sørensen–Dice coefficient 1. *Definition*: Also known as F1 score, given two finite sets $X$ and $Y$, the Sørensen–Dice coefficicent is defined by: $$DSC = \dfrac{2 \abs{X \cap Y}}{\abs{X} + \abs{Y}}$$ 2. *Property*: - The F1 score between two finite sets is symmetric and has its value inside the set $[0,1]$. - In relation to the Jaccard index between two finite sets, we have $S = \dfrac{2J}{1+J}$ and $J = \dfrac{S}{2-S}$. 3. *Meaning*: F1 scores are used to gauge similarity of two samples. 4. *Variant 1:* Given two binary vectors ${\bf a}$ and ${\bf b}$, the Sørensen–Dice index can be formulated as $$s_v = \dfrac{2 \abs{{\bf a} \cdot {\bf b}}}{\abs{{\bf a}}^2 + \abs{{\bf b}}^2}$$ 5. *Variant 2:* For a discrete ground truth and continuous measures the *continuous Dice coefficient* is defined by: $$ cDC = \dfrac{2 \abs{X \cap Y}}{c \ast \abs{X} + \abs{Y}} $$ where $c$ can be computed as follows: $$c = \dfrac{\sum a_i b_i}{\sum a_i \hspace{.1cm} \text{sign} \hspace{.1cm}(b_i)}$$ If $\sum a_i \hspace{.1cm} \text{sign} \hspace{.1cm}(b_i) = 0$ which means no overlap between $A$ and $B$ , $c$ is set to $1$. 6. *Usage*: F1 score is useful for ecological community data, the justification for which is primarily empirical. Other uses include measuring lexical association score of two given words in computer lexicography and comparing algorithm output against reference masks in medical applications of image segmentation. # Overlap coefficient 1. *Definition*: Given two finite sets $X$ and $Y$, the *overlap coefficicent* between them is calculated by: $$\text{overlap} \hspace{.1cm} (X,Y) = \dfrac{\abs{X \cap Y}}{\min(\abs{X}, \abs{Y})}$$ 2. *Meaning*: The overlap coefficicent is a similarity measure which measures the overlap between two finite sets. 3. *Property*: - The overlap coefficient, by definition, is symmetric and has value inside the set $[0,1]$. - If one set is a subset of the other, then the overlap coefficient is $1$. 4. Usage: Apparently there are some uses for this similarity measure in text mining according to [this](https://www.aircconline.com/mlaij/V3N1/3116mlaij03.pdf), but so do other similarity measures. # Tversky index 1. *Definition*: For finite sets $X$ and $Y$, the Tversky index is calculated by: $$ S(X,Y) = \dfrac{\abs{X \cap Y}}{\abs{X \cap Y} + \alpha \abs{X-Y} + \beta \abs{Y-X}} $$ where $X - Y$ denotes the relative complement of $Y$ in $X$ and $\alpha, \beta$ are parameters. 2. *Meaning*: The Tversky index is an asymmetric/symmetric (depending on the parameters) similarity measure on sets. 3. *Property*: - If $\alpha \neq \beta$, then the Tversky index is indeed asymmetric. - Setting $\alpha=\beta=1$ gives the [Tanimoto coefficient](http://www.biotech.fyicenter.com/1000134_What_Is_Tanimoto_coefficient.html). - Setting $\alpha=\beta=0.5$ gives the F1 score. 4. *Variant*: If symmetry is needed, a variant of the original formulation has been proposed using max and min functions: $$S(X,Y) = \dfrac{\abs{X \cap Y}}{\abs{X \cap Y} + \beta(\alpha a + (1-\alpha)b)}$$ where $a = \min(\abs{X - Y}, \abs{Y - X})$ and $b = \max(\abs{X-Y}, \abs{Y-X})$. 5. *Usage*: The Tversky index serves as a generalization of the F1 score and the Jaccard index, thus any application of the aforementioned two is an application of the Tversky index.