Gaurav Sharma
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Assignment 3 ## Table of Contents - [Assignment 3](#assignment-3) * [Table of Contents](#table-of-contents) * [Feature / Bug Selection](#feature---bug-selection) + [Missing features removal with SimpleImputer #16426](https://github.com/scikit-learn/scikit-learn/issues/16426) + [Fibonacci heap for Ball Tree #597](https://github.com/scikit-learn/scikit-learn/issues/597) * [Acceptance Testing](#acceptance-testing) + [Test 1: Basic Test](#test-1--basic-test) + [Test 2: Performance Test](#test-2--performance-test) * [Performance Improvements](#performance-improvements) + [Complexity Analysis](#complexity-analysis) + [Performance Benchmarks](#performance-benchmarks) * [User Guide](#user-guide) ## Feature / Bug Selection ### [Missing features removal with SimpleImputer #16426](https://github.com/scikit-learn/scikit-learn/issues/16426) For background, the process of imputing data involves filling in the blanks with existing data. In ``scikit-learn``, imputing data is possible through several different imputing modules such as ``SimpleImputer`` under ``sklearn.impute``. There are also several different imputing techniques such as ``mean``, ``median``, ``most_frequent``, and ``constant``. The bug documented in this section occurs when attempting to impute multi-dimensional arrays. **Bug:** ``` >>> from sklearn.impute import SimpleImputer >>> import numpy as np >>> X = np.array([[0, np.nan], [1, np.nan]]) >>> SimpleImputer(strategy='median').fit_transform(X) array([[0.], [1.]]) ``` **Expected:** ``` >>> SimpleImputer(strategy='median').fit_transform(X) array([[0., nan], [1., nan]]) ``` In multi-dimensional arrays, all columns are treated as individual sets of data. Unfortunately, in the example above, ``X`` has an entirely blank second column ``[np.nan, np.nan]``. The blank elements cannot be imputed as there is no existing data to refer to. The column is silently trucanted from the output, changing the dimensions of the input matrix. This is undocumented behaviour that only occurs with imputing strategies ``mean``, ``median``, and ``most_frequent``. An unwarrented change in the matrix's dimensions can cause problems when it is pipelined across multiple instructions. **UML Diagram:** <p align="center"> <br> <img src="https://kthisiscvpv.com/OyrjtZDY4sohNRB35Z.png"> <br><br> UML Diagram generated with <a href="https://www.pynsource.com/">Pynsource</a> </p> **Investigation:** Unfortunately, creating a patch to this issue comes with several different complications. We begin by addressing the fact that there exists 3 different imputers inside ``scikit-learn``; after reading the issue's ticket, it is evident that the maintainers of the ``scikit-learn`` expect equivalent behavior across all imputers. Furthermore, after investigating why the problem occurs, we note that while this behaviour is undocumented, it is actually intentionally handled as an edge case. For example, the code sample below handles the masking of invalid columns inside the ``SimpleImputer`` module inside its ``transform(...)`` function. ``` $ sklearn/impute/_base.py def transform(self, X): ... # Delete the invalid columns if strategy is not constant if self.strategy == "constant": ... else: ... ``` However, it is also observed that the masking of input data is different for all of the 3 different imputers; deploying a patch to ``SimpleImputer`` will not patch the issue inside ``KNNImputer`` nor ``IterativeImputer``. **Patch Design:** This patch must be backwards compatible; projects that already handle this issue on their own must not be broken as a result of this specific patch. This concern is also noted inside the issue's ticket as well, under several different comments. An additional parameter should be added to the inputs of the different imputer modules. This parameter should act as a toggle for this patch. In the investigation of the legacy code base shown above, we note that this masking of invalid columns is actually intended behaviour. To simply prevent this from happening, a check can be made before deletion of columns occur. Unfortunately, also mentioned in the investigations above, all modules handle the masking of invalid columns differently. Therefore, we cannot make use of the inheritance hierarchy that all imputers inherit off of ``class _BaseImputer``. All modules will be modified individually. Interactions between the legacy modules should not change. However, instructions inside some of the functions in each imputer may change. The UML diagram below highlights the expected changes to be made to both classes and their functions in this patch. <p align="center"> <br> <img src="https://kthisiscvpv.com/NYpAJsbSZH0ZTA3lgC.png"> </p> ### [Fibonacci heap for Ball Tree #597](https://github.com/scikit-learn/scikit-learn/issues/597) For context, this fix involves optimizing the neighbour search for `BallTree`. Currently, the neighbour search algorithm uses priority queue and binary heaps which are considerably slower when compared to a fibonacci heap especially for large `n_neighbors` and `n_features`. The fix for this involves refactoring out existing fibonacci heap implementation and using it inside `graph_shortest_path` module as shown in the UML below. **UML Diagram:** ![](https://i.imgur.com/0h22O7q.png) ## Acceptance Testing This features involves improving the performance of the BallTree class. Currently BallTree is inherting BinaryTree, however it is proposed using a Fibonacci Heap would result in better performance. Since the refactoring we are doing cannot be directly interacted with by a user, we will be using NearestNeighbors as our class to test upon. It is the lowest level class that we can modify as users to run the code we have improved. This can be seen below in the following snipet of code. <!-- Note for Gaurav: This is some code that u can use, ur actually working with NearestNeighbors and not even BallTree --> ``` 1| from sklearn.neighbors import NearestNeighbors 2| import numpy as np 3| X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) 4| nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X) 5| distances, indices = nbrs.kneighbors(X) ``` What this code will do is take the coordinates and find the closest pair values. ![](https://i.imgur.com/uBLBqsY.png) The above image is an example of how it would the `ball_tree` algorithm may be used to slowly focus in on groups and make the circles smaller until you have k points per circle. ### Test 1: Basic Test First test we can run is to see if NearestNeighbors is working with the BallTree algorithm. This will test our underlying changes. We can run steps to imitate the code above. 1) Import the NearestNeighbors class aswell as numpy (line 1-2 from above) 2) Create an array of coordinates inside an array notation with numpy (line 3 from above) 3) Create the NearestNeighbors class objects with the `n_neighbors` parameter being the number of neighbors you want grouped together. For example 2 means you want to find the 2 nearest coordinates, 3 would mean you want 3 coordinates which which are the closest to eachother, ext. We will choose 2 for `n_neighbors`. For the `algorithm` parameter you must select `'ball_tree'`, since that is the algorithm class in which we have added our new Fibonacci Heap implementation (line 4 from above) 4) Run the NearestNeighbors objects `kneighbors` function with the array of cordinates from step 2 (line 5 from above) 5) The expected outcome will be a list of distances and a list of indices. The distances array will contain the distances to the closest point(s) (since we choose `n_neighbors` as 2 it will show the distance to the neighbor nearest point). The indices contains the index of the coordinates involved from the original array passed in for the respective distances. You are basically provided the closest neighbor point of every point in the original array. ### Test 2: Performance Test Since the point of this refactor is to see improvements in performance we can try a performance test. We can run the following steps on both the old and new implementation to hopefully see performance improvements. 1) Import the NearestNeighbors class aswell as numpy 2) Create an array with over 500000 cordinates 3) Create a NearestNeighbors class object and set the `n_neighbors` to 100, and select the `algorithm` to be `'ball_tree'` 4) Run the`kneighbors` function on the array from step 2. Note down the time it took to finish this computation 5) Compare the runtime of step 4 using the old and new implementation and confirm the new implementation has a better runtime ## Performance Imporvements ### Complexity Analysis As stated previously, the purpose of this enhancement is to improve the performance of methods in the `BallTree` class by using a Fibonacci Heap instead of a regular Binary Heap. Let us examine the complexity of operations for both data structures. | Operations | Binary Heap | Fibonacci Heap| | ------------- |:-------------:| :-----: | | make-heap | O(1) | O(1) | | isempty | O(1) | O(1) | | insert | O(log n) | **O(1)** | | delete-min (or max) | O(log n) | O(log n) | | decrease-key (or increase) | O(log n) | **O(1)** | Table obtained from: [Princeton University](https://www.cs.princeton.edu/~wayne/teaching/fibonacci-heap.pdf) Notice that insertion operation and decreasing/increasing key operation have their complexity reduced from logarithmic to constant. Such an improvement can have drastic improvements for models using the `BallTree` class, as these are elementary operations that are used frequently. Caveat: the column shown for the Fibonacci Heap is its amortized complexity, whereas the column for the Binary Heap is the worst-case complexity. However, this doesn't detract from the value of using Fibonacci Heap as Binary Heaps don't implement any form of amortization, thus making Fibonacci Heaps significantly faster. ### Performance Benchmarks Here we will anaylze the actual run-times of `BallTree` methods and run-times other classes that use `BallTree`. First examining the second acceptance test we get the following results: |Trial Number | Initial Run-Time | Run-Time after Update | |-------------| ------------- |:-------------:| |1| 92.27815200000000012s | 88.87392354011536s | |2| 96.310639999999998s | 84.058232199999999984s | |3| 95.0502696s | 91.43743699999999996s | |4| 95.69870156484651s | 92.034031900000000004s | This example shows how a more efficient `BallTree` has effects on methods in other classes that use it (in this case it is the `NearestNeighbors` class). While it may seem like the actual time has improved by an insignificant amount, one must keep in mind that the example we are using is quite small and such an operation occurs innumerable times in actual machine learning pipelines. We also measured the run-time multiple times to gain an accurate understanding. Now we will test specifically the `query()` method in the `BallTree` class. This method is responsible for finding x-number of nearest neighbours, where x is an input. We will vary the sample size and number of features to gain a holistic view of the difference in run-times. |Sample Size | Number of Features | Initial Run-Time | Run-Time after Update | | ------------- | ------------- |:-------------:|:-------------:| |100000 | 1 | 1.032856700000000016s | 0.8860089999999996s | |100000 | 2 | 3.70338561s | 3.242743299999999s | |100000 | 3 | 10.034818900000000014s | 7.261899s | |100000 | 4 | 24.07591970000000003s | 19.4624235s | |200000 | 1 | 1.01295156s | 1.3102832000000006s | |200000 | 2 | 10.913645199999998s | 10.03558810000000001s | |200000 | 3 | 33.03558810000000001s | 29.04188570000001s | |200000 | 4 | 78.0441502s | 65.31413970000001s | |300000 | 1 | 8.0347606s | 9.0776396s | |300000 | 2 | 15.066563100000025s | 13.066563100000025s | |300000 | 3 | 49.9066350000000001s | 42.383289399999995s | |300000 | 4 | 331.30973699999999993s | 204.73572170000003s | Notice how there is a general trend, that as the initial run-time increases, the run-time after the update is much lower. This shows the effect of reducing the complexity from logarithmic to constant. Note that there are some cases, where the run-time after the update is worse, this is likely because Fibonacci heap require more computation to initilaize. Note that timing code is machine-depentent and results can vary depending on the machine the tests are ran on, the number of proceess running and several other factors. However, the trend remains clear that the run-time is significatly lowered after the update. ## User Guide As this feature is a performance enhancement and not a tangible user feature, it is not possible to produce a guide for the user. However, the following is a guide for fellow developers on the newly created `FibonacciHeap` data structure in the `sklearn/utils` package. ### sklearn.utils.fib ```struct FibonacciHeap``` | Attribute Name | Type | Details | | -------------- | :----------------: | :--------------------------------------: | | min_node |FibonacciNode* | The node in the heap with minimum weight | | roots_by_rank | pFibonacciNode[100] | The roots of the fibonacci heap based on the rank values. The rank of a node denotes the number of children that the node has. No two roots in the fibonacci heap have the same rank. <br> **i.e:** roots_by_rank[0] is the root with 0 children, roots_by_rank[1] is the root with 1 child, etc. | | Method | Returns | Parameters | Details | | ------------ | :------------: | :--------: | :-----: | | insert_node | void | __FibonacciHeap* heap:__ The heap to perform the operation on <br> __FibonacciNode* node:__ The node to insert | Inserts a given node into the heap | | decrease_val | void | __FibonacciHeap* heap:__ The heap to perform the operation on <br> __FibonacciNode* node:__ The node to decrease <br> **DType_t (np.float64_t) newval:** The new value of the node | Decrease the value of a given node in the heap | | link | void | __FibonacciHeap* heap:__ The heap to perform the operation on <br> __FibonacciNode* node:__ The node to link | Links a node as a root if possible, else insert the node into the tree that occupies the root spot | | remove_mid | FibonacciNode* | __FibonacciHeap* heap:__ The heap to perform the operation on | Removes the node with the lowest weight | ```struct FibonacciNode``` | Attribute Name | Type | Details | | -------------- | :--------------------: | :---------------------------------------------------------------: | | index | unsigned int | The location of the node in the associated graph | | rank | unsigned int | The number of children the node has | | state | unsigned int | The state of the node (0 = Not in heap, 1 = In heap, 2 = Scanned) | | val | DTYPE_t (np.float64_t) | The value stored inside the node | | i1 | ITYPE_t (np.intp_t) | | | i2 | ITYPE_t (np.intp_t) | | | parent | FibonacciNode* | The pointer to the current node's parent | | left_sibling | FibonacciNode* | The pointer to the current node's left sibling | | right_sibling | FibonacciNode* | The pointer to the current node's right sibling | | children | FibonacciNode* | The pointer to the current node's child | | Method | Returns | Parameters | Details | | ---------------- | :------------: | :--------: | :-----: | |initialize_node |void | __FibonacciNode* node:__ The `FibonacciNode` to initialize <br> **unsigned int index:** The location of node to initialize in the associated graph <br> **DTYPE_t val:** The value stored inside of node to initialize <br> **ITYPE_t i1:** <br> **ITYPE_t i2:** <br> | Initialize the `FibonacciNode` to default values | |rightmost_sibling |FibonacciNode* | __FibonacciNode* node:__ The `FibonacciNode` to perform this operation on | Retrun the right-most sibling of the `FibonacciNode` | |leftmost_sibling |FibonacciNode* | __FibonacciNode* node:__ The `FibonacciNode` to perform this operation on |Retrun the left-most sibling of the `FibonacciNode`| |add_child |void |__FibonacciNode* node:__ The `FibonacciNode` to add a child to <br> __FibonacciNode* new_child:__ The `FibonacciNode` that is the child| Add a child to a `FibonacciNode` | |add_sibling |void |__FibonacciNode* node:__ The `FibonacciNode` to add a sibling to <br> __FibonacciNode* new_sibling:__ The `FibonacciNode` that is the sibling|Add a sibling to a `FibonacciNode`| |remove|void|__FibonacciNode* node:__ The `FibonacciNode` to remove|Remove `FibonacciNode` from its associated graph| The one of the usages of `FibonacciHeap` is within `sklearn/neighbors/_binary_tree.pxi`, however, the heap is fully encapsulated within the `Binary Tree` so there is no change in developer usage.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully