---
title: "ClustCheck"
author: "O + A + F"
date: "7 december 2020"
output: md_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.width=12, fig.height=6)
```
# ClustCheck
The ClustCheck package is intended to provide tools to check and analyse the results of a clustering from a machine learning algorithm. It offers a combination of evaluation metrics and graphical visualisations to understand what variables drive the structure of the clusters and to test the quality of the clustering.
## Installing the package
```{r}
#devtools::install_github("adrienPAVOINE/ClustCheck")
```
## Tutorial for package usage
### Loading the library
Once the package is installed, the library can be load using the standard commands from R.
```{r}
library(ClustCheck)
```
### Dataset Import
First of all, you need to import a dataset (with numerical, categorical variables or both). In this example, we'll be using the BankCustomer dataset. This dataset is included in the ClustCheck package.
```{r}
BankCustomer <- ClustCheck::BankCustomer
```
The BankCustomer dataset includes information about bank customers and contains 4 categorical variables and 4 numerical variables. To test the package with this dataset, you have to call the *Dataset()* function to create your class object. The function takes a dataframe and a vector of predicted cluster groups as input. A vector of real cluster groups can be used as optional input (to be used within the *EvaluateC()* function).
```{r}
cbank <- ClustCheck::Dataset(BankCustomer, BankCustomer$Cluster)
```
Now that our dataset has been loaded to the object *cbank*, we can proceed with the cluster analysis. If you have any doubt about a function, use **help(name_function)**, exemple : **help(Dataset)**.
**Note: In the following code, *cbank* is the object instanciated by the Dataset() function previously shown.**
---
### Analysing the clustering with single variables
Let's start our analysis by looking at how single variables influence the clustering.
We can deal with two possible variable types ; categorical variables and numerical ones. We'll be starting with the former.
#### Categorical variables
- Cramer's V
Here, the Cramer's V values will be computed for the 4 categorical variables of our dataset.
```{r}
ClustCheck::vcramer(cbank)
```
To return the Cramer's V values only between the cluster groups and the variable *profession* for example, the variable needs to be entered as a function input.
```{r}
ClustCheck::vcramer(cbank, var = BankCustomer$profession)
```
A bar graph of the values can be plotted with the function *plotVCramer()*.
```{r}
ClustCheck::plotVCramer(cbank)
```
- V-Test
The Test values can be computed either against the modes of a categorical variable (i.e. profession) with the function *tvalue_cat* or against the numerical variables with the function *tvalue_num()*. We use *tvalue_cat()* in our example below.
```{r}
ClustCheck::tvalue_cat(cbank, var = BankCustomer$profession)
```
- Phi-value
The Phi values can be computed against the modes of a categorical variable (i.e. profession) with the function *phivalue()*.
```{r}
ClustCheck::phivalue(cbank, var = BankCustomer$profession)
```
Bar graphs can be plotted with the function *plotphi()*.
```{r}
ClustCheck::plotphi(cbank, var = BankCustomer$profession)
```
- Correspondance Analysis
The clustering can be analysed through visualisation by plotting the clusters against the modes of a categorical variable (i.e. profession). It shows the frequency of the modes in each cluster as well as a plot of the coordinates of the clusters centers against the modes of the variable.
```{r}
ClustCheck::vizAFC(cbank, var = BankCustomer$profession)
```
#### Numerical Variables
Lets's focus now on the numerical variables of our dataset.
- Correlation
Correlation ratios can be computed for all numerical variables.
```{r}
ClustCheck::corr_ratios(cbank)
```
A bar graph of the ratios can be plotted with the function *plotcorr()*.
```{r}
ClustCheck::plotcorr(cbank)
```
- V-Test call
Similar to the categorical variables, test values can be computed for the numerical variables with the function *tvalue_num()*.
```{r}
ClustCheck::tvalue_num(cbank)
```
A bar graph of the values can be plotted with the function *plottvalue()*.
```{r}
ClustCheck::plottvalue(cbank)
```
- Effect size
Effect size is another way to measure the strength of the relationship between variables and cluster groups. Cohen's magnitude description can be used as a useful scale to evaluate this strength. In our example below, we can see the overwhelming influence of revenue in the cluster group 1.
https://en.wikipedia.org/wiki/Effect_size
```{r}
ClustCheck::effectsize(cbank)
```
Bar graphs can be plotted with the function *plotsizeeff()*.
```{r}
ClustCheck::plotsizeeff(cbank)
```
---
### Analysing the clustering with multiple variables
Now we want to look at the influence on our clustering of the combination of multiple variables. We will need here to use the principles of Principal Component Analysis.
https://en.wikipedia.org/wiki/Principal_component_analysis
#### Categorical variables
The function *get_MCA()* offers a graphic visualisation of a Multiple Component Analysis on the categorical variables of the dataset.
```{r}
ClustCheck::get_MCA(cbank)
```
#### Numerical variables
The function *get_PCA()* offers a graphic visualisation of a Principal Component Analysis on the numerical variables of the dataset.
```{r}
ClustCheck::get_PCA(cbank)
```
#### Mixed variables
The function *get_FAMD()* offers a graphic visualisation of a Factorial Analysis of Mixed Data. This function is intended for datasets with mixed data only, like in our case example.
```{r}
ClustCheck::get_FAMD(cbank)
```
---
### Evaluation metrics
The packages offers the choice of 3 metrics to evaluate the quality of the clustering :
- Silhouette coefficient
- Davies-Boulding index
- Dunn index
The *silhouetteC* function returns the silhouette coefficient for each cluster as well as a computed mean for the whole partition. The silhouette coefficient varies from -1 to +1. A coefficient close to 1 indicates a good clustering. On the contrary a coefficient close to -1 indicates a poor clustering.
https://en.wikipedia.org/wiki/Silhouette_(clustering)
```{r}
ClustCheck::silhouetteC(cbank)
```
The Davies_bouldin index is always positive. Zero is a perfect score and indicates the best clustering. The higher the score, the poorer the clustering otherwise.
https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index
```{r}
ClustCheck::davies_bouldinC(cbank)
```
The Dunn index is similar to the Davis-Bouldin in its measure of clustering quality. The index value should be interpreted differently though. In this cas, the higher the score, the better the clustering.
https://simple.wikipedia.org/wiki/Dunn_index
```{r}
ClustCheck::dunn_indexC(cbank)
```
---
### That's it!
You've completed an overview of the package main functions.
Hopefully it will provide useful tools for happy clustering evaluation!