# Assignment 4
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Feature / Bug Selection](#feature---bug-selection)
+ [Bisecting K-means \[#14214\]](#bisecting-k-means-14214)
+ [Add Sparse Matrix Support For HistGradientBoostingClassifier \[#15336\]](#add-sparse-matrix-support-for-histgradientboostingclassifier-15336)
- [Feature Investigation](#feature-investigation)
+ [Bisecting K-means \[#14214\]](#bisecting-k-means-14214-1)
+ [Add Sparse Matrix Support For HistGradientBoostingClassifier \[#15336\]](#add-sparse-matrix-support-for-histgradientboostingclassifier-15336-1)
- [Feature Implementation](#feature-implementation)
- [Acceptance Testing](#acceptance-testing)
+ [Basic Test Case](#test-1-basic-test-case)
+ [Handwritten Digits](#test-2-handwritten-digits)
+ [Bisection Iteration](#test-3-bisection-iteration)
+ [Performance](#test-4-performance)
+ [Accuracy (Inertia / Entropy)](#test-5-accuracy-inertia--entropy)
- [User Guide](#user-guide)
## Feature / Bug Selection
### [Bisecting K-means \[#14214\]](https://github.com/scikit-learn/scikit-learn/issues/16426)
K-means is a classification algorithm that clusters a set of data based on a specific hyperparameter K.
For example, a set of data below has been classified according to 5 individual clusters.
<p align="center">
<br>
<img width="50%" height="50%" src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/LloydsMethod15.svg/600px-LloydsMethod15.svg.png">
<br><br>
Source: <a href="https://en.wikipedia.org/wiki/Lloyd%27s_algorithm">Wikipedia</a>
</p>
``scikit-learn`` has a collection of clustering algorithms under ``sklearn/cluster``. For this issue, we will solely focus on the K-means clustering implementation. K-means forms clusters by attempting to select centroids in specific positions. The objective of K-means is to reduce the sum of squared errors calculated from the distance between any individual data point to their closest centroids. In the image above, we can see that 5 centroids have been placed on each of the 5 different data points, reducing SSE close to 0.
``scikit-learn`` only has 2 implementation of K-means: [Lloyd's Algorithm](https://en.wikipedia.org/wiki/Lloyd%27s_algorithm) and [Elkan's Algorithm](https://link.springer.com/chapter/10.1007/978-3-642-16773-7_2). The author of the issue ticket requests for a 3rd implementation of K-means: [The Bisecting K-means Algorithm](http://www.philippe-fournier-viger.com/spmf/bisectingkmeans.pdf).
### [Add Sparse Matrix Support For HistGradientBoostingClassifier \[#15336\]](https://github.com/scikit-learn/scikit-learn/issues/15336)
``HistGradientBoostingClassifier`` is the ``scikit-learn`` class they used to represent their Histogram-based Gradient Boosting Classification Tree machine learning estimation model.
Currently trying to fit a sparse matrix with the ``HistGradientBoostingClassifier`` will result an the following error being thrown:
```
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
```
a sparse matrix is a matrix which the majority of elements being zero

Source: [cmdlinetips](https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/)
Some of these transformations and estimations may not work and yeild useful data when used with sparse data. This issue wants to add sparse matrix support to make the data we get back from functions like fit and transform to be more useful. In order to do so we have to kind of convert the sparse matrix into a dense one, or pack the 0s with temporary data to kind of force more useful results.
## Feature Investigation:
### [Bisecting K-means \[#14214\]](https://github.com/scikit-learn/scikit-learn/issues/16426)
In order to implement a new algorithm for scikit-learn's K-means module, we must first understand how the two existing algorithms are implemented.
**Sample Code:**
The following is a code sample taken off of the scikit-learn's [K-means documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) to showcase the execution of data being clustered.
```
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, algorithm="elkan").fit(X)
>>> kmeans.cluster_centers_
[[10. 2.]
[ 1. 2.]]
```
<p align="center">
<br>
<img src="https://kthisiscvpv.com/E8k3806MYBakghvK9H.png">
<br><br>
Plot generated with <a href="https://www.geogebra.org/m/Adc44ZZq">GeoGebra</a>
</p>
---
**UML Diagram:**
The following is the UML diagram for ``sklearn/cluster/_kmeans.py``. It shows the underlying functions and interactions between scikit-learn's implementation of K-means and it's neighbouring resources.
<p align="center">
<br>
<img src="https://kthisiscvpv.com/dcFKZSraNqQ1MYq7Nn.png">
<br><br>
UML Diagram generated with <a href="https://www.pynsource.com/">Pynsource</a>
</p>
---
**Sequence Diagram**:
From the sample code above, we note that a single call to the ``fit(...)`` function inside the ``KMeans`` class is enough to extrapolate all the data necessary for identifying the locations of the different centroids. The following is a sequence diagram that traces the timeline of all relevant actions inside that specific function.
<p align="center">
<br>
<img src="https://kthisiscvpv.com/WDmokEgBsyuByG0FxA.png">
<br><br>
Sequence Diagram drawn with <a href="https://app.diagrams.net/">Diagrams</a>
</p>
### [Add Sparse Matrix Support For HistGradientBoostingClassifier \[#15336\]](https://github.com/scikit-learn/scikit-learn/issues/15336)
**UML Diagram:**
The following is the UML for the `HistGradientBoostingClassifier` in `sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py`. In order to add support for sparse matrices we need to first dig deeper to find out where the issue is.

`HistGradientBoostingClassifier` inherits `BaseHistGradientBoosting`, which uses an instance of the `_BinMapper` class inside functions like `fit(..)`. `_BinMapper` is where all the binning functionality takes place with the matrix. We can work with `_BinningMapper` and if the matrix is sparse we can apply some tranformations to it so it does not consume too much memory. To do this we would need to work with the `_BinMapper` class aswell as make a few modification to the helper function `_find_binning_thresholds`, which is in the same file but outside the actual class.
## Feature Implementation: [Bisecting K-means \[#14214\]](https://github.com/scikit-learn/scikit-learn/issues/16426)
**Bisecting K-means:**
Bisecting K-means is an algorithm that forms clusters with repeated calls to K-means using hyperparameter ``K = 2``. The complete formulation of the bisecting K-means algorithm can be found within the [original research paper](http://www.philippe-fournier-viger.com/spmf/bisectingkmeans.pdf) from the Department of Computer Science and Engineering at the University of Minnesota.
Suppose we have a set of data points ``S = ( P1, P2, P3, P4, P5, P6, P7, P8, P9, P10 )``.
Bisecting K-means will always start with a single centroid that groups all sets of data at their mean. At each step, the centroid belonging to the largest cluster is selected to be split up into 2 clusters using 2-means.
Suppose a call to Bisecting K-means is made with ``K = 3``. The procedure would look something like this.
<p align="center">
<br>
<img src="https://media.geeksforgeeks.org/wp-content/uploads/20200827122613/CC.png">
<br><br>
Source: <a href="https://www.geeksforgeeks.org/bisecting-k-means-algorithm-introduction/">GeeksForGeeks</a>
</p>
```
A: All data is grouped in a single cluster G.
B: Cluster G is split using 2-means as it is the only cluster and has the largest size. The end result is 2 clusters, G1 and G2.
C: Without loss of generality, cluster G1 is split using 2-means as it is a cluster of largest size. The end result is 3 clusters, G1', G1'', and G2.
D: The algorithm terminates as we have now reached 3 clusters, where K = 3.
```
**Code Design:**
Combining all the information that the digrams above provide, we now know how an algorithm is both selected and ran before the final resulting centroids are returned. When implementing an additional algorithm, we will spend the majority of our time exploring and modifying these critical sections.
| Task | Detailed Description |
| --- | --- |
| Validate Parameters | Check that all input parameters are valid. For example, check that the specified algorithm is in {``auto``, ``full``, ``elkan``}. |
| Identify Algorithm | Set the algorithm to the corresponding function. Algorithm ``full`` is handled by ``_kmeans_single_lloyd(...)`` while ``elkan`` is handled by ``_kmeans_single_elkan(...)``. |
From a top-down perspective, algorithms can be implemented inside K-means by writing a function to perform a single iteration of the respective algorithm, similar to how ``_kmeans_single_lloyd(...)`` and ``_kmeans_single_elkan(...)`` are currently implemented. To implement Bisecting K-means, we will follow the exact same formula: to write a similar function, ``_kmeans_single_bisect(...)`` and hook it into ``fit(...)`` where it validates the parameters and maps the location of the algorithm's function.
The UML diagram of the final result should only have a single addition from the original, as shown in ``red`` below.
<p align="center">
<img src="https://kthisiscvpv.com/AsydLGtJEkvYB4mUCZ.png">
<br>
UML Diagram generated with <a href="https://www.pynsource.com/">Pynsource</a>
</p>
## Acceptance Testing
We test the acceptance of our module by comparing the computed results of the Bisecting K-means Algorithm with the two existing Lloyd's Algorithm and Elkan's Algorithm.
### Test 1: Basic Test Case
We test the output of the three algorithms with the data set provided in the sample code above.
```
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
```
**Lloyd's Algorithm:**
```
>>> KMeans(n_clusters=2, algorithm="full").fit(X).cluster_centers_
array([[10., 2.],
[ 1., 2.]])
```
**Elkan's Algorithm:**
```
>>> KMeans(n_clusters=2, algorithm="elkan").fit(X).cluster_centers_
array([[10., 2.],
[ 1., 2.]])
```
**Bisecting Algorithm:**
```
>>> KMeans(n_clusters=2, algorithm="bisect").fit(X).cluster_centers_
array([[10., 2.],
[ 1., 2.]])
```
We note that the results from all algorithms are equivalent and produce optimal cluster centers for a simple dataset.
### Test 2: Handwritten Digits
We test the output of the three algorithms based on scikit-learn's collection of handwritten digits and classify each point according to K-means using hyperparameter ``K=10`` (because there are 10 digits in total). The script to run the sample file can be [found here](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html).
<p align="center">
<img src="https://kthisiscvpv.com/U4n5nF6GuttKWZurrg.png">
<br>
Diagram generated with <a href="https://matplotlib.org/">matplotlib</a>
</p>
We note that the results, while not equivalent, each algorithm generates roughly the same set of clusters.
### Test 3: Bisection Iteration
The primary difference between the Bisection Algorithm versus any other K-means algorithm is the fact that centroids with the highest entropy are split further according to K-means with hyperparameter ``K = 2``. We test the iterative step by mapping the process that our algorithm takes to reach ``K = 10`` using the same handwritten digits data.
<p align="center">
<img src="https://kthisiscvpv.com/O37DTh7Kvx9PzI7hYh.gif">
<br>
Diagram generated with <a href="https://matplotlib.org/">matplotlib</a>
</p>
We note from the animation above that between each step, there is only at most two drastic changes occuring in the location of the centroids. This is expected by each iterative call to the 2-means algorithm.
### Test 4: Performance (Time)
From scikit-learn's K-means documentation, we note that the worst case complexity of K-means is bounded by ``O(n^(k+2/p))`` where ``n = elements``, ``p = features per element``, and ``k = number of clusters``. Using Python's ``time`` library, we calculate the running time (the lower the better) of our Bisecting Algorithm implementation and compare it against the current Lloyd's and Elkan's Algorithm.
The code used to test run-time performance can be [found here](./a4/test-cases/test-performance.py).
| Features | Elements | Clusters | Lloyd | Elkan | Bisecting |
| --- | --- | --- | --- | --- | --- |
| 10 | 100 | 2 | 107.097ms | 20.861ms | 76.393ms |
| 10 | 100 | 5 | 84.932ms | 23.378ms | 302.476ms |
| 10 | 100 | 10 | 90.949ms | 29.634ms | 658.481ms |
| 10 | 1000 | 2 | 106.069ms | 59.662ms | 95.067ms |
| 10 | 1000 | 5 | 106.598ms | 80.146ms | 346.854ms |
| 10 | 1000 | 10 | 157.510ms | 83.725ms | 761.749ms |
| 10 | 10000 | 2 | 119.004ms | 74.524ms | 120.352ms |
| 10 | 10000 | 5 | 202.134ms | 263.154ms | 423.205ms |
| 10 | 10000 | 10 | 312.527ms | 560.376ms | 904.111ms |
| 100 | 100 | 2 | 82.086ms | 16.965ms | 83.351ms |
| 100 | 100 | 5 | 86.091ms | 23.043ms | 309.091ms |
| 100 | 100 | 10 | 94.155ms | 29.724ms | 673.187ms |
| 100 | 1000 | 2 | 109.103ms | 66.111ms | 109.025ms |
| 100 | 1000 | 5 | 115.507ms | 73.496ms | 436.755ms |
| 100 | 1000 | 10 | 137.315ms | 88.272ms | 809.500ms |
| 100 | 10000 | 2 | 317.349ms | 260.744ms | 243.177ms |
| 100 | 10000 | 5 | 580.998ms | 591.368ms | 700.779ms |
| 100 | 10000 | 10 | 678.042ms | 645.668ms | 1308.553ms |
| 1000 | 100 | 2 | 82.979ms | 20.414ms | 90.196ms |
| 1000 | 100 | 5 | 96.644ms | 27.976ms | 330.556ms |
| 1000 | 100 | 10 | 116.727ms | 52.404ms | 718.313ms |
| 1000 | 1000 | 2 | 143.409ms | 85.589ms | 217.682ms |
| 1000 | 1000 | 5 | 165.132ms | 102.691ms | 592.300ms |
| 1000 | 1000 | 10 | 187.865ms | 120.803ms | 1097.845ms |
| 1000 | 10000 | 2 | 2866.064ms | 1829.593ms | 2775.521ms |
| 1000 | 10000 | 5 | 3340.756ms | 2361.407ms | 5584.730ms |
| 1000 | 10000 | 10 | 3282.576ms | 3317.328ms | 9663.712ms |
We note that the Bisecting Algorithm is the slowest of the bunch. This is **expected behaviour** as the Bisecting Algorithm runs K-means for a total of K times. Therefore, we expect the worst time complexity to also grow exponentially by a factor of K.
### Test 5: Accuracy (Internal Indexing Inertia)
[Internal indexing](http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf) is used to measure the goodness of a clustering structure without external information. A measure of internal indexing is Sum of Squared Errors (SSE).
The code used to test accuracy can be [found here](./a4/test-cases/test-inertia.py).
| Features | Elements | Clusters | Lloyd | Elkan | Bisecting |
| --- | --- | --- | --- | --- | --- |
| 10 | 100 | 2 | 73.371 | 73.487 | 73.371 |
| 10 | 100 | 5 | 61.918 | 61.185 | 62.137 |
| 10 | 100 | 10 | 48.343 | 47.666 | 48.666 |
| 10 | 1000 | 2 | 774.143 | 772.141 | 771.953 |
| 10 | 1000 | 5 | 651.540 | 649.125 | 662.643 |
| 10 | 1000 | 10 | 571.694 | 574.553 | 596.835 |
| 10 | 10000 | 2 | 7703.832 | 7696.776 | 7697.166 |
| 10 | 10000 | 5 | 6788.716 | 6790.343 | 6915.192 |
| 10 | 10000 | 10 | 5937.797 | 5936.636 | 6259.366 |
| 100 | 100 | 2 | 802.977 | 801.959 | 799.928 |
| 100 | 100 | 5 | 755.649 | 758.886 | 757.295 |
| 100 | 100 | 10 | 715.896 | 711.576 | 712.094 |
| 100 | 1000 | 2 | 8230.157 | 8237.135 | 8230.255 |
| 100 | 1000 | 5 | 8072.491 | 8072.855 | 8088.982 |
| 100 | 1000 | 10 | 7882.990 | 7881.862 | 7912.239 |
| 100 | 10000 | 2 | 82691.645 | 82690.190 | 82695.594 |
| 100 | 10000 | 5 | 81589.005 | 81607.048 | 81811.436 |
| 100 | 10000 | 10 | 80547.635 | 80563.899 | 81036.810 |
| 1000 | 100 | 2 | 8136.267 | 8124.461 | 8124.144 |
| 1000 | 100 | 5 | 7876.410 | 7873.343 | 7872.733 |
| 1000 | 100 | 10 | 7422.554 | 7410.799 | 7437.763 |
| 1000 | 1000 | 2 | 82980.482 | 82985.398 | 82979.943 |
| 1000 | 1000 | 5 | 82558.679 | 82557.988 | 82533.452 |
| 1000 | 1000 | 10 | 82070.562 | 82070.155 | 82030.494 |
| 1000 | 10000 | 2 | 832171.899 | 832188.425 | 832177.489 |
| 1000 | 10000 | 5 | 830844.377 | 830836.480 | 831011.734 |
| 1000 | 10000 | 10 | 828978.115 | 828955.640 | 829347.268 |
We conclude that the accuracy of Bisecting K-means is as good as the Lloyd and Elkan algorithms.
### Test 6: Entropy (External Indexing Entropy)
[External indexing](http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf) is used to measure the extent to which cluster labels match externally supplied expected class labels. A measure of external indexing is entropy.
The key benefit of Bisecting K-means is lower entropy (better) at the cost of higher running time.
Computing the entropy involves the following:
1. For each cluster j, compute P<sub>ij</sub>, the probability that a member of cluster j belongs to class i
1. For each cluster j, compute , the entropy of cluster j
1. The total entropy of the set of all clusters is then 
As the computation of entropy is extremely statistically and mathematically heavy, the computation is out of the scope of CSCD01.
## User Guide
`sklearn.cluster.KMeans`
| Parameter Name | Type | Details |
| -------------- | ---------------- | ------- |
| n_clusters | int, default=8 | The number of clusters to form as well as the number of centroids to generate |
| init | {'k-means++', 'random'} OR array-like of shape (n_clusters, n_features), default='k-means++' | Method of initializing model centroids OR array of initial model centroids (only for 'full', 'elkan' algorithm) |
| n_init | int, default=10 | Number of times the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia |
| max_iter | int, default=300 | Maximum number of iterations of the k-means algorithm for a single run |
| tol | float, default=1e-4 | Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence |
| precompute_distances | {'auto', True, False}, default='auto' | Precomputed distances from n_samples to every cluster center |
| verbose | int, default=0 | Verbosity mode |
| random_state | int, default=0 | Determines the random number generation for centroid initialization |
| copy_x | bool, default=True | Disable modification of original input parameter X during model fiting |
| n_jobs | int, default=None | The number of OpenMP threads to use for parallel computation |
| algorithm | {'auto', 'full', 'elkan', 'bisect'}, default='auto' | K-means algorithm to use. 'auto' defaults to the elkan algorithm |
| Attribute Name | Type | Details |
| -------------- | ---- | ------- |
| cluster_centers_ | ndarray of shape (n_clusters, n_features) | The values of the cluster centers |
| labels_ | ndarray of shape (n_samples_) | labels of each data point |
| inertia_ | float | Sum of squared errors of samples to their closest cluster center |
| n_iter_ | int | Number of iterations run |
| Method | Returns | Parameters | Details |
| ------ | ------- | ---------- | ------- |
| fit | self | **matrix X**: Training instances to cluster <br> **array-like sample_weight**: The weights for each observation in X | Compute K-Means clustering |
| fit_predict | ndarray (n_samples,)| **matrix X**: Training instances to cluster <br> **array-like sample_weight**: The weights for each observation in X | Compute cluster centers and predict cluster index for each sample |
| fit_transform | ndarray (n_samples, n_clusters) | **matrix X**: Training instances to cluster <br> **array-like sample_weight**: The weights for each observation in X | Compute clustering and transform X to cluster-distance space |
| get_params | dict | **bool deep**: If True, return the parameters for the estimator | Get parameters for this estimator |
| predict | ndarray (n_samples,) | **matrix X**: Training instances to cluster <br> **array-like sample_weight**: The weights for each observation in X | Predict the closest cluster each sample in X belongs to |
| score | float | **matrix X**: Training instances to cluster <br> **array-like sample_weight**: The weights for each observation in X | Computes the opposite of the value of X on the K-Means objective |
| set_params | self | **dict params**: Estimator parameters | Set the parameters of the estimator |
| transform | ndarray (n_samples, n_clusters) | **matrix X**: New data to transform | Transform X to a cluster-distance space |