Assignment 3 - HackMD

# Assignment 3 ## Table of Contents - [Assignment 3](#assignment-3) * [Table of Contents](#table-of-contents) * [Feature / Bug Selection](#feature---bug-selection) + [Missing features removal with SimpleImputer #16426](https://github.com/scikit-learn/scikit-learn/issues/16426) + [Fibonacci heap for Ball Tree #597](https://github.com/scikit-learn/scikit-learn/issues/597) * [Acceptance Testing](#acceptance-testing) + [Test 1: Basic Test](#test-1--basic-test) + [Test 2: Performance Test](#test-2--performance-test) * [Performance Improvements](#performance-improvements) + [Complexity Analysis](#complexity-analysis) + [Performance Benchmarks](#performance-benchmarks) * [User Guide](#user-guide) ## Feature / Bug Selection ### [Missing features removal with SimpleImputer #16426](https://github.com/scikit-learn/scikit-learn/issues/16426) For background, the process of imputing data involves filling in the blanks with existing data. In ``scikit-learn``, imputing data is possible through several different imputing modules such as ``SimpleImputer`` under ``sklearn.impute``. There are also several different imputing techniques such as ``mean``, ``median``, ``most_frequent``, and ``constant``. The bug documented in this section occurs when attempting to impute multi-dimensional arrays. **Bug:** ``` >>> from sklearn.impute import SimpleImputer >>> import numpy as np >>> X = np.array([[0, np.nan], [1, np.nan]]) >>> SimpleImputer(strategy='median').fit_transform(X) array([[0.], [1.]]) ``` **Expected:** ``` >>> SimpleImputer(strategy='median').fit_transform(X) array([[0., nan], [1., nan]]) ``` In multi-dimensional arrays, all columns are treated as individual sets of data. Unfortunately, in the example above, ``X`` has an entirely blank second column ``[np.nan, np.nan]``. The blank elements cannot be imputed as there is no existing data to refer to. The column is silently trucanted from the output, changing the dimensions of the input matrix. This is undocumented behaviour that only occurs with imputing strategies ``mean``, ``median``, and ``most_frequent``. An unwarrented change in the matrix's dimensions can cause problems when it is pipelined across multiple instructions. **UML Diagram:** <img src="https://kthisiscvpv.com/OyrjtZDY4sohNRB35Z.png"> UML Diagram generated with <a href="https://www.pynsource.com/">Pynsource</a> **Investigation:** Unfortunately, creating a patch to this issue comes with several different complications. We begin by addressing the fact that there exists 3 different imputers inside ``scikit-learn``; after reading the issue's ticket, it is evident that the maintainers of the ``scikit-learn`` expect equivalent behavior across all imputers. Furthermore, after investigating why the problem occurs, we note that while this behaviour is undocumented, it is actually intentionally handled as an edge case. For example, the code sample below handles the masking of invalid columns inside the ``SimpleImputer`` module inside its ``transform(...)`` function. ``` $ sklearn/impute/_base.py def transform(self, X): ... # Delete the invalid columns if strategy is not constant if self.strategy == "constant": ... else: ... ``` However, it is also observed that the masking of input data is different for all of the 3 different imputers; deploying a patch to ``SimpleImputer`` will not patch the issue inside ``KNNImputer`` nor ``IterativeImputer``. **Patch Design:** This patch must be backwards compatible; projects that already handle this issue on their own must not be broken as a result of this specific patch. This concern is also noted inside the issue's ticket as well, under several different comments. An additional parameter should be added to the inputs of the different imputer modules. This parameter should act as a toggle for this patch. In the investigation of the legacy code base shown above, we note that this masking of invalid columns is actually intended behaviour. To simply prevent this from happening, a check can be made before deletion of columns occur. Unfortunately, also mentioned in the investigations above, all modules handle the masking of invalid columns differently. Therefore, we cannot make use of the inheritance hierarchy that all imputers inherit off of ``class _BaseImputer``. All modules will be modified individually. Interactions between the legacy modules should not change. However, instructions inside some of the functions in each imputer may change. The UML diagram below highlights the expected changes to be made to both classes and their functions in this patch. <img src="https://kthisiscvpv.com/NYpAJsbSZH0ZTA3lgC.png"> ### [Fibonacci heap for Ball Tree #597](https://github.com/scikit-learn/scikit-learn/issues/597) For context, this fix involves optimizing the neighbour search for `BallTree`. Currently, the neighbour search algorithm uses priority queue and binary heaps which are considerably slower when compared to a fibonacci heap especially for large `n_neighbors` and `n_features`. The fix for this involves refactoring out existing fibonacci heap implementation and using it inside `graph_shortest_path` module as shown in the UML below. **UML Diagram:** ![](https://i.imgur.com/0h22O7q.png) ## Acceptance Testing This features involves improving the performance of the BallTree class. Currently BallTree is inherting BinaryTree, however it is proposed using a Fibonacci Heap would result in better performance. Since the refactoring we are doing cannot be directly interacted with by a user, we will be using NearestNeighbors as our class to test upon. It is the lowest level class that we can modify as users to run the code we have improved. This can be seen below in the following snipet of code.  ``` 1| from sklearn.neighbors import NearestNeighbors 2| import numpy as np 3| X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) 4| nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X) 5| distances, indices = nbrs.kneighbors(X) ``` What this code will do is take the coordinates and find the closest pair values. ![](https://i.imgur.com/uBLBqsY.png) The above image is an example of how it would the `ball_tree` algorithm may be used to slowly focus in on groups and make the circles smaller until you have k points per circle. ### Test 1: Basic Test First test we can run is to see if NearestNeighbors is working with the BallTree algorithm. This will test our underlying changes. We can run steps to imitate the code above. 1) Import the NearestNeighbors class aswell as numpy (line 1-2 from above) 2) Create an array of coordinates inside an array notation with numpy (line 3 from above) 3) Create the NearestNeighbors class objects with the `n_neighbors` parameter being the number of neighbors you want grouped together. For example 2 means you want to find the 2 nearest coordinates, 3 would mean you want 3 coordinates which which are the closest to eachother, ext. We will choose 2 for `n_neighbors`. For the `algorithm` parameter you must select `'ball_tree'`, since that is the algorithm class in which we have added our new Fibonacci Heap implementation (line 4 from above) 4) Run the NearestNeighbors objects `kneighbors` function with the array of cordinates from step 2 (line 5 from above) 5) The expected outcome will be a list of distances and a list of indices. The distances array will contain the distances to the closest point(s) (since we choose `n_neighbors` as 2 it will show the distance to the neighbor nearest point). The indices contains the index of the coordinates involved from the original array passed in for the respective distances. You are basically provided the closest neighbor point of every point in the original array. ### Test 2: Performance Test Since the point of this refactor is to see improvements in performance we can try a performance test. We can run the following steps on both the old and new implementation to hopefully see performance improvements. 1) Import the NearestNeighbors class aswell as numpy 2) Create an array with over 500000 cordinates 3) Create a NearestNeighbors class object and set the `n_neighbors` to 100, and select the `algorithm` to be `'ball_tree'` 4) Run the`kneighbors` function on the array from step 2. Note down the time it took to finish this computation 5) Compare the runtime of step 4 using the old and new implementation and confirm the new implementation has a better runtime ## Performance Imporvements ### Complexity Analysis As stated previously, the purpose of this enhancement is to improve the performance of methods in the `BallTree` class by using a Fibonacci Heap instead of a regular Binary Heap. Let us examine the complexity of operations for both data structures. | Operations | Binary Heap | Fibonacci Heap| | ------------- |:-------------:| :-----: | | make-heap | O(1) | O(1) | | isempty | O(1) | O(1) | | insert | O(log n) | **O(1)** | | delete-min (or max) | O(log n) | O(log n) | | decrease-key (or increase) | O(log n) | **O(1)** | Table obtained from: [Princeton University](https://www.cs.princeton.edu/~wayne/teaching/fibonacci-heap.pdf) Notice that insertion operation and decreasing/increasing key operation have their complexity reduced from logarithmic to constant. Such an improvement can have drastic improvements for models using the `BallTree` class, as these are elementary operations that are used frequently. Caveat: the column shown for the Fibonacci Heap is its amortized complexity, whereas the column for the Binary Heap is the worst-case complexity. However, this doesn't detract from the value of using Fibonacci Heap as Binary Heaps don't implement any form of amortization, thus making Fibonacci Heaps significantly faster. ### Performance Benchmarks Here we will anaylze the actual run-times of `BallTree` methods and run-times other classes that use `BallTree`. First examining the second acceptance test we get the following results: |Trial Number | Initial Run-Time | Run-Time after Update | |-------------| ------------- |:-------------:| |1| 92.27815200000000012s | 88.87392354011536s | |2| 96.310639999999998s | 84.058232199999999984s | |3| 95.0502696s | 91.43743699999999996s | |4| 95.69870156484651s | 92.034031900000000004s | This example shows how a more efficient `BallTree` has effects on methods in other classes that use it (in this case it is the `NearestNeighbors` class). While it may seem like the actual time has improved by an insignificant amount, one must keep in mind that the example we are using is quite small and such an operation occurs innumerable times in actual machine learning pipelines. We also measured the run-time multiple times to gain an accurate understanding. Now we will test specifically the `query()` method in the `BallTree` class. This method is responsible for finding x-number of nearest neighbours, where x is an input. We will vary the sample size and number of features to gain a holistic view of the difference in run-times. |Sample Size | Number of Features | Initial Run-Time | Run-Time after Update | | ------------- | ------------- |:-------------:|:-------------:| |100000 | 1 | 1.032856700000000016s | 0.8860089999999996s | |100000 | 2 | 3.70338561s | 3.242743299999999s | |100000 | 3 | 10.034818900000000014s | 7.261899s | |100000 | 4 | 24.07591970000000003s | 19.4624235s | |200000 | 1 | 1.01295156s | 1.3102832000000006s | |200000 | 2 | 10.913645199999998s | 10.03558810000000001s | |200000 | 3 | 33.03558810000000001s | 29.04188570000001s | |200000 | 4 | 78.0441502s | 65.31413970000001s | |300000 | 1 | 8.0347606s | 9.0776396s | |300000 | 2 | 15.066563100000025s | 13.066563100000025s | |300000 | 3 | 49.9066350000000001s | 42.383289399999995s | |300000 | 4 | 331.30973699999999993s | 204.73572170000003s | Notice how there is a general trend, that as the initial run-time increases, the run-time after the update is much lower. This shows the effect of reducing the complexity from logarithmic to constant. Note that there are some cases, where the run-time after the update is worse, this is likely because Fibonacci heap require more computation to initilaize. Note that timing code is machine-depentent and results can vary depending on the machine the tests are ran on, the number of proceess running and several other factors. However, the trend remains clear that the run-time is significatly lowered after the update. ## User Guide As this feature is a performance enhancement and not a tangible user feature, it is not possible to produce a guide for the user. However, the following is a guide for fellow developers on the newly created `FibonacciHeap` data structure in the `sklearn/utils` package. ### sklearn.utils.fib ```struct FibonacciHeap``` | Attribute Name | Type | Details | | -------------- | :----------------: | :--------------------------------------: | | min_node |FibonacciNode* | The node in the heap with minimum weight | | roots_by_rank | pFibonacciNode[100] | The roots of the fibonacci heap based on the rank values. The rank of a node denotes the number of children that the node has. No two roots in the fibonacci heap have the same rank. **i.e:** roots_by_rank[0] is the root with 0 children, roots_by_rank[1] is the root with 1 child, etc. | | Method | Returns | Parameters | Details | | ------------ | :------------: | :--------: | :-----: | | insert_node | void | __FibonacciHeap* heap:__ The heap to perform the operation on __FibonacciNode* node:__ The node to insert | Inserts a given node into the heap | | decrease_val | void | __FibonacciHeap* heap:__ The heap to perform the operation on __FibonacciNode* node:__ The node to decrease **DType_t (np.float64_t) newval:** The new value of the node | Decrease the value of a given node in the heap | | link | void | __FibonacciHeap* heap:__ The heap to perform the operation on __FibonacciNode* node:__ The node to link | Links a node as a root if possible, else insert the node into the tree that occupies the root spot | | remove_mid | FibonacciNode* | __FibonacciHeap* heap:__ The heap to perform the operation on | Removes the node with the lowest weight | ```struct FibonacciNode``` | Attribute Name | Type | Details | | -------------- | :--------------------: | :---------------------------------------------------------------: | | index | unsigned int | The location of the node in the associated graph | | rank | unsigned int | The number of children the node has | | state | unsigned int | The state of the node (0 = Not in heap, 1 = In heap, 2 = Scanned) | | val | DTYPE_t (np.float64_t) | The value stored inside the node | | i1 | ITYPE_t (np.intp_t) | | | i2 | ITYPE_t (np.intp_t) | | | parent | FibonacciNode* | The pointer to the current node's parent | | left_sibling | FibonacciNode* | The pointer to the current node's left sibling | | right_sibling | FibonacciNode* | The pointer to the current node's right sibling | | children | FibonacciNode* | The pointer to the current node's child | | Method | Returns | Parameters | Details | | ---------------- | :------------: | :--------: | :-----: | |initialize_node |void | __FibonacciNode* node:__ The `FibonacciNode` to initialize **unsigned int index:** The location of node to initialize in the associated graph **DTYPE_t val:** The value stored inside of node to initialize **ITYPE_t i1:** **ITYPE_t i2:** | Initialize the `FibonacciNode` to default values | |rightmost_sibling |FibonacciNode* | __FibonacciNode* node:__ The `FibonacciNode` to perform this operation on | Retrun the right-most sibling of the `FibonacciNode` | |leftmost_sibling |FibonacciNode* | __FibonacciNode* node:__ The `FibonacciNode` to perform this operation on |Retrun the left-most sibling of the `FibonacciNode`| |add_child |void |__FibonacciNode* node:__ The `FibonacciNode` to add a child to __FibonacciNode* new_child:__ The `FibonacciNode` that is the child| Add a child to a `FibonacciNode` | |add_sibling |void |__FibonacciNode* node:__ The `FibonacciNode` to add a sibling to __FibonacciNode* new_sibling:__ The `FibonacciNode` that is the sibling|Add a sibling to a `FibonacciNode`| |remove|void|__FibonacciNode* node:__ The `FibonacciNode` to remove|Remove `FibonacciNode` from its associated graph| The one of the usages of `FibonacciHeap` is within `sklearn/neighbors/_binary_tree.pxi`, however, the heap is fully encapsulated within the `Binary Tree` so there is no change in developer usage.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.