# Correlation Coefficients
###### tags: `Data Science`, `Metrics`
## Dataset
Suppose we have two vectors of continuous random variables $\pmb{x}$ and $\pmb{y}$, both with $n$ elements.
## Pearson's Correlation Coefficient
Pearson's correlation coefficient is used to measure the __linear__ relationship between two random variables. Pearson's correlation coefficient is __sensitive to ouliers__, and required the data to be measured on __interval__ or __ratio scale__.
The pearson's correlation coefficient is defined as follows:
$r_{pearson}=\frac{\sum_{i=1}^{n}(x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum_{i=1}^n(x_i-\overline{x})^2}\sqrt{\sum_{i=1}^n(y_i-\overline{y})^2}}$
Under the null hypothesis that two variables are being independent ($\rho=0$), the quantity
$t=\frac{r_{pearson}\sqrt{(n-2)}}{\sqrt{(1-r_{pearson}^2)}}$
follows a Student's t distribution with $n-2$ degrees of freedom.
## Spearman's Rank Correlation Coefficient
Sparman's rank correlation coefficient is a __nonparametric__ measure of __nonlinear monotonic__ relationship between two random variables. In addition, it does not require the data to be measured on interval or ratio scale, and can also be used for ordinal data. However, it is more computationally intensive to calculate when applying to large-scale data.
To calculate the Spearman's rank correlation coefficient, calculate the rank for $\pmb{x}$ and $\pmb{y}$, denoted as $\pmb{rk^{(x)}}$ and $\pmb{rk^{(y)}}$. For variables with tied value, the average ranks are assigned. The Speaman's correlation coefficient can then be defined as follows:
$r_{spearman}=\frac{\sum_{i=1}^{n}(rk^{(x)}_i-\overline{rk^{(x)}})(rk_i^{(y)}-\overline{rk^{(y)}})}{\sqrt{\sum_{i=1}^n(rk^{(x)}_i-\overline{rk^{(x)}})^2}\sqrt{\sum_{i=1}^n(rk_i^{(y)}-\overline{rk^{(y)}})^2}}$
Under the null hypothesis that two variables are being independent ($\rho=0$), the quantity
$t=\frac{r_{spearman}\sqrt{(n-2)}}{\sqrt{(1-r_{spearman}^2)}}$
follows a Student's t distribution with $n-2$ degrees of freedom.
## Kendall's Rank Correlation Coefficient
Kendall's rank correlation coefficient is a __nonparametric__ measure of __nonlinear monotonic__ relationship between two random variables. In most of the situations, the interpretations of Kendall’s rank correlation coefficient and Spearman’s rank correlation coefficient are very similar and thus invariably lead to the same inferences. However, the p values are more accurate with smaller sample sizes for Kendall's rank correlation coefficient.
To calculate the Kendalls' rank correlation coefficient, we need to first sort $\pmb{x}$ and $\pmb{y}$, denoted as $\pmb{s^{(x)}}$ and $\pmb{s^{(y)}}$ respectively. For any pair of variables $(s^{(x)}_i, s^{(x)}_j)$ and $(s^{(y)}_i, s^{(y)}_j)$, we defined it to be __concordant__ if (1) $s^{(x)}_i > s^{(x)}_j \land s^{(y)}_i > s^{(y)}_j$ or (2) $s^{(x)}_i < s^{(x)}_j \land s^{(y)}_i < s^{(y)}_j$, and to be __disconcordant__ if (1) $s^{(x)}_i > s^{(x)}_j \land s^{(y)}_i < s^{(y)}_j$ or (2) $s^{(x)}_i < s^{(x)}_j \land s^{(y)}_i > s^{(y)}_j$, and to be __tied__ if $s^{(x)}_i = s^{(x)}_j \lor s^{(y)}_i = s^{(y)}_j$
$\tau=\frac{(number\;of\;concordant pairs) - (number\;of\;disconcordant pairs)}{n \choose 2}$
For the adjustment for tied values, please refer to [Wikipedia (Kendall's Rank Correlation Coefficient: Accounting for Ties)](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient#Accounting_for_ties)
## Biweight Midcorrelation Coefficient
Biweight midcorrelation coefficient is used to measure the __linear__ relationship between two random variables. It is median based and thus is less sensitive to outliers. It is widely used in weighted correlation network analysis and can be considered as a robust alternative to Pearson's correlation coefficient.
To calculate the biweight midcorrelation coefficient, we first define median absolute deviation to be:
$mad(\pmb{x})=med(|\pmb{x}-med(\pmb{x})|)$
Then, calculate $\pmb{u}$ and $\pmb{v}$ for $\pmb{x}$ and $\pmb{y}$ respectively:
$u_i=\frac{x_i-med(\pmb{x})}{9mad(\pmb{x})}$
$v_i=\frac{y_i-med(\pmb{y})}{9mad(\pmb{y})}$
Based on the value $\pmb{u}$ and $\pmb{v}$, further define the weight to be:
$w^{(x)}_i=(1-u^2_i)^2I(1-|u_i|)$
$w^{(y)}_i=(1-v^2_i)^2I(1-|v_i|)$
where the indicator function $I(1-|u_i|)$ equals to 1 if $1-|u_i| > 0$ and $0$ otherwise.
Finally, the biweight midcorrelation coefficient can be calculated as follows:
$bicor(x, y)=\frac{\sum^m_{i=1}(x_i-med(x))w^{(x)}_i(y_i-med(y))w^{(y)}_i}{\sqrt{\sum^m_{j=1}[(x_j-med(x))w^{(x)}_j]^2}\sqrt{\sum^m_{k=1}[(y_k-med(y))w^{(y)}_k]^2}}$
## Lin's Concordance Correlation Coefficient
Concordance correlation coefficient is used to measures the agreement between two random variables.
The Lin's concordance correlation coefficient is defined as follows:
$r_{concordance}=\frac{2s_{xy}}{s_x^2+s_y^2+(\overline{x}-\overline{y})^2}$
where
$\overline{x}=\frac{1}{n}\sum_{i=1}^nx_i$,
$s_x=\frac{1}{n}\sum_{i=1}^n(x_i-\overline{x})^2$
$s_xy=\frac{1}{n}\sum_{i=1}^n(x_i-\overline{x})(y_i-\overline{y})$
Note that the concordance correlation coefficient may be computed slightly differently between implementations.
## Reference
* [Wikipedia (Pearson Correlation Coefficient)](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)
* [Wikipedia (Spearman's Rank Correlation Coefficient)](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
* [Wikipedia (Biweight Midcorrelation)](https://en.wikipedia.org/wiki/Biweight_midcorrelation)
* [Wikipedia (Concordance Correlation Coefficient)](https://en.wikipedia.org/wiki/Concordance_correlation_coefficient)
* [Lumen Introduction to Statistic: Testing the Significance of the Correlation Coefficient](https://courses.lumenlearning.com/introstats1/chapter/testing-the-significance-of-the-correlation-coefficient/)
* [Statistics Solutions:
Kendall’s Tau and Spearman’s Rank Correlation Coefficient](https://www.statisticssolutions.com/kendalls-tau-and-spearmans-rank-correlation-coefficient/)
* Kumari, S., J. Nie, H. S. Chen, H. Ma, R. Stewart, X. Li, M. Z. Lu, W. M. Taylor, and H. Wei. 2012. 'Evaluation of gene association methods for coexpression network construction and biological knowledge discovery', PLoS One, 7: e50411.
* Zheng, C. H., L. Yuan, W. Sha, and Z. L. Sun. 2014. 'Gene differential coexpression analysis based on biweight correlation and maximum clique', BMC Bioinformatics, 15 Suppl 15: S3