# CSCD01 Assignment 1
### Team Name: bestteam
### Team Members:
- Anand Karki
- Charles Xu
- Donnie Siu
- Gaurav Sharma
- Jun Zheng
- Syed Sharjeel Haider
- Smit Patel
## Overall Code Structure
The starting point of the package is located at `__init__.py`, in which all subpackages are defined.
Generally subpackages are defined within its own sub-directory, however some exceptions exist such as multiclass, where it is simply defined in `multiclass.py`
Within each subpackage, classes it exports are defined again within `__init__.py`. Most packages are independent from each other, and don't depend on each other. Most packages are dependant on the base class, `base.py`, and the utils package which contains common utilities methods.
## Visual Paradigm UML
The following in the image generated from Visual Paradigm. The image is too large for this document, so it can instead be found [here on GitHub](https://github.com/UTSCCSCD01/course-project-bestteam/blob/master/a1/scikit-learn-vp-uml.jpg).
## High Level Overview

Looking at scikit-learn packages from a high-level, we can generalize them into 3 core components: estimators, predictors, and transformers. You can think of an estimator as a test that you run on a dataset. All the base classes live in `base.py` including all the mixin classes for different types of estimators. We found that all estimators extend from the function of `BaseEstimator` and are build upon with the different mixins. The two most common types of classes created with `BaseEstimator` and mixin combinations are `Predictors` and `Transformers`.
- Predictor classes usually have a `predict`, `score`, `fit`, and `fit_predict` method implemented in them.
- Predictors are types of tests you run on an input dataset to get some sort of prediction

- Meanwhile, transformer classes have `transform`, `fit`, and `fit_transform` methods implemented inside them.
- Transformers are types of functions you run on an input dataset to transform it in some way or another

Besides the three core components, scikit provides three other components which are a composition of core components with additional functionality.
- Meta-estimators are multiple estimators chained into a single estimator
- Pipelines combine multiple transformers and a predictor into a single estimator

- Model selector are meta-estimators that run the estimators multiple times with different values and pick the best results
All these components are crucial in helping users build a workflow that suits their needs
### Design Patterns
### Builder Design Pattern
The ``tree`` package in scikit-learn encapsulates tree creation using the builder design pattern. The following class UML diagram displays a few sample classes that interact with the TreeBuilder. Some class attributes and methods were ommitted in favour of isolating the builder pattern.
Class UML

The following sequence diagram illustrates a sample event that invokes the builder pattern. The builder is invoked when the fit() method is called for the DecisionTreeClassifier class.
Sequence Diagram

Here are the associated classes defined on the UML diagram:
- Tree: https://github.com/scikit-learn/scikit-learn/blob/8ea176ae0ca535cdbfad7413322bbc3e54979e4d/sklearn/tree/_tree.pyx#L495
- TreeBuilder: https://github.com/scikit-learn/scikit-learn/blob/8ea176ae0ca535cdbfad7413322bbc3e54979e4d/sklearn/tree/_tree.pyx#L82
- DepthFirstTreeBuilder: https://github.com/scikit-learn/scikit-learn/blob/8ea176ae0ca535cdbfad7413322bbc3e54979e4d/sklearn/tree/_tree.pyx#L121
- BaseDecisionTree: https://github.com/scikit-learn/scikit-learn/blob/8ea176ae0ca535cdbfad7413322bbc3e54979e4d/sklearn/tree/_classes.py#L80
- DecisionTreeClassifier: https://github.com/scikit-learn/scikit-learn/blob/8ea176ae0ca535cdbfad7413322bbc3e54979e4d/sklearn/tree/_classes.py#L607
### Factory Design Pattern
As there are several different implementations of various data modelling techniques in ``scikit-learn``, the various implementations are connected and generalized using the factory design pattern.
For one specific instance, we look at how neutral networks are implemented in scikit-learn and how the use of the factory design pattern makes construction of both the MLPClassifier and MLPRegressor clean and easy for others to both use and understand.


From the UML diagram above, we note that most, if not all classifiers and regressors are a subclass of either ClassifierMixin or RegressorMixin respectively, both with their own required functions such as ``predict()`` and ``score()``. We also note that MLPClassifier builds a classifier object from a BaseMultilayerPerceptron object while MLPRegressor builds a regressor object from the very same BaseMultilayerPerceptron object. The way that children of ClassifierMixin and RegressorMixin are not predetermined and are left as a decision to be made by their subclasses. Here, we can see that a neutral network is built to be used to both classify and regress data.
Here are the associated classes defined on the UML diagram:
- BaseEstimator: https://github.com/scikit-learn/scikit-learn/blob/dac560551c5767d9a8608f86e3f253e706026189/sklearn/base.py#L141
- ClassifierMixin: https://github.com/scikit-learn/scikit-learn/blob/dac560551c5767d9a8608f86e3f253e706026189/sklearn/base.py#L470
- RegressorMixin: https://github.com/scikit-learn/scikit-learn/blob/dac560551c5767d9a8608f86e3f253e706026189/sklearn/base.py#L506
- BaseMultilayerPerceptron: https://github.com/scikit-learn/scikit-learn/blob/dac560551c5767d9a8608f86e3f253e706026189/sklearn/neural_network/_multilayer_perceptron.py#L42
- MLPClassifier: https://github.com/scikit-learn/scikit-learn/blob/dac560551c5767d9a8608f86e3f253e706026189/sklearn/neural_network/_multilayer_perceptron.py#L701
- MLPRegressor: https://github.com/scikit-learn/scikit-learn/blob/dac560551c5767d9a8608f86e3f253e706026189/sklearn/neural_network/_multilayer_perceptron.py#L1127
### Dependency Injection Pattern
Dependency Injection is used throughout scikit-learn as a means of passing in complex parameters into an object while mainting modularity.
In the example here, we look at the class Pipeline, whose responsiblity is to apply a list of transformations to somedata. That list of transformation is injected in as a list of tuples of string and Transform objects. By Transform objects we mean classes that are a subclass of ``TransformerMixin``.
Since there are a lot of classes that fall into this Transform category (over a 100 classes), we will simply look the top parent level class, the ``TransformerMixin`` class and some subclasses.
Class UML

Sequence Diagram

Here are the associated classes defined on the UML diagram:
- Pipeline: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/pipeline.py
- TransformMixin: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/base.py
- The rest of the classes: https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/preprocessing