# BD1003_P17 Machine Learning with Python
# Day 3
## Recap on Day 2
## Basic scikit-learn
- workflow for supervised learning
1. Define the problem
2. prepare the data / understand the data
3. split the data into training / testing sets
4. pre-process the data (training set and testing set should under same preprocessing steps)
5. choose a model/algorithm
6. select/tune the model hyperparameters
7. train the model with training set
8. evaluate the model with the testing set
9. deploy / use the model
- k-nearest neighbours algorithm
- toy model
- balance the trade-off of model accuracy vs model robustness (generalisation)
- linear models
- $\hat{y} = \beta_1 x_1 + \beta_2 x_2 + \dots + c$ or $y = X\beta$
- linear regression - $\arg \min \sum || y - X\beta ||^2$
- Ridge regression - $\arg \min \sum || y - X\beta ||^2 + \alpha ||\beta||^2$
- Lasso regression - $\arg \min \sum || y - X\beta ||^2 + \alpha ||\beta||_1$
- Logistic regression - modify linear model to classification task
1. $z = \beta_1 x_1 + \beta_2 x_2 + c$
2. $p = \frac{1}{1+e^{-z}}$
- decision tree
- highly overfit model, you MUST tune the model
- DT use impurity measure to do partitioning: Ginni or entropy
- pruning tree:
- mannual prunning: max_depth, min_samples_leaf, min_samples_split
- cost-complexity path: $\min Error(T) + \alpha |T|$
- grid search to find the optimal hyperparameters
# Ensemble
# Neural Network
# Feature Selection
# Model Selection
# Unsupervised learning: clustering
# Unsupervised learning: dimensionality reduction
<hr>
# Day 2
## Recap on Day 1
- Introduction to Machine Learning
- Supevised vs Unsupervised learning
- two types tasks: Regression and Classification
- various algorithms for regression/classification
- trends and tools: scikit-learn (ML), pandas (data wrangling), matplotlib
- Datasets
- where we can get datasets: Kaggle or UCI repo
- tidy datesets:
- 1 column 1 feature
- 1 row 1 sample
- description text for your dataset
- Exploratory Data Analysis (EDA)
- doing the ML models, we assuming a lot things:
- all features are similar range
- assume data are normally distributed
- there are some relationships between features and target
- understand the data first:
- histogram/distribution
- scatter plot
- Data Pre-processing
- types of data: categorical and numeric
- categorical:
- binary -> encoded into 0 and 1
- nominal -> one-hot encoding
- ordinal -> encode into ascending or descending integers
- numeric:
- scaling - this is to make the data about the same range
- standard scaler: $(x - mean) / std$
- robust scaler: $(x - median) / IQR$
- min-max scaler: $(x - x_min)/(x_max - x_min)$
- transformation
- used in converting none normally distributed data into normally distributed data
- missing values
- in `pandas` the missing values are represented by `None` or `np.nan`
- drop the missing values
- impute / fill in the missing values with representative data
## Scikit-learn
- scikit-learn or sklearn is popular due to its consistent and easy to use API
- Supervised Learning workflow
1. Define your problem
2. Prepare data / understand data (EDA)
3. data pre-processing: scaling, transform, encoding
4. split data into training and testing sets
5. select an algorithm
6. select hyperparameter of the algorithm
7. traing the model by using the training set
8. evaluate the performance of the model by using the testing set
9. deploy the model
## K-nearest Neighbour
## Linear Models
## Decision Trees
## Ensemble
<hr>
# Day 1
## Setup Python
- Two options: Anaconda or Official Python
- Setup for Official Python
## Introduction to Machine Learning
- [x] What types of machine learning?
- supervised vs unsupervised
- deep learning
- predictive modelling and generative modelling
- reinforcement learning
- [x] Supervised vs unsupervised learning?
- supervised learning: input data, X and target output, y
- algorithms: k-nearest neighbour, linear models, decision trees, neural network, support vector machines, random forest, etc ...
- unsupervised learning: input data, X
- clustering: K-means, DBSCAN
- outliers/anomaly detection: DBSCAN
- dimensionality reduction: PCA, NMF
- [x] Regression tasks and Classification tasks
- Regression: y is continuous, eg: predict housing price
- classification: y is discrete, eg: cat or not cat
- [x] trends and tools
- tools: scikit-learn (building model), pandas (data processing), visualisation (matplotlib)
## Basic Data Exploration
### Setup Official Python
- [x] create a virtual environment on the Desktop
- find out where is your python executable
- C:\Users\gohyk\AppData\Local\Programs\Python\Python312\python.exe
- create a virtual env
- change directory to where you want to keep virtualenv
- cd Desktop
- <path-to-python> -m venv base
- [x] create a shortcut to activate virtual environment
- create run_base.bat file using notepad.exe
- cmd.exe /K C:\Users\gohyk\Desktop\base\Scripts\activate.bat
- [x] install relevant libraries: pandas, matplotlib, scikit-learn, jupyter
- start the virtual env and pip install
- pip install pandas matplotlib scikit-learn jupyter
### Using Jupyter Notebook
- start Jupyter notebook
- open a ipynb file
- working region are cells (fragments of code)
- two types of cells: code cell (python), markdown cell (documentation)
- markdown notations
- "simplified html"
- provide minimal format your text
- shortcuts on Jupyter notebook

- hyperlink
- reference for markdown: [guide](https://www.markdownguide.org/basic-syntax/)
- reference for math equations: [latex](https://en.wikibooks.org/wiki/LaTeX/Mathematics)
- mathematical equations
- $y = \beta x + \alpha$
$$e^{i\pi}+ 1 = 0$$
### Using `pandas`
- dataframe = spreadsheet
- series = a column / a row
- load csv
- pick a column, several columns
- pick a row, several rows
# Exploratory Data Analysis
## Types of data
- Categorical: binary, nominal, ordinal
- binary: 0/1
- nominal: size does not matter
- eg: colour of a car
- ordinal: size does matter, you can "order" your labels
- eg: baby, kids, adult, elderly
- Numerical
- understanding of types of data affects how you encode your catogerical data
- nominal data -> use one-hot encoding
- ordinal data -> use ordinal encoding
- Caution: something the numeric data, might not means it is not categorical
- data: [1, 3, 4, 1, 1, 2]
- the above data are the answers of a MCQ survey.
## Statistics
- ML models usually have certain assumptions
- concept distance/similarity -> the range of the data affects the model
- statistics often assume data are normally distributed
=> we are interested with the **distribution** of the data
- We want to remove redundant features or overly represented features
- you can identify this by **scatter plot**