Fundamentals of Machine Learning

# Fundamentals of Machine Learning ---  ### Machine learning models that learn from the past and predict on the future --- ### 1: Supervised vs. Unsupervised learning ---- **Supervised learning** modelling  a relationship  between input and output data ---- **Unsupervised learning** grouping and interpretation of data  that does not have a target  --- ### 2 Classification vs. Regression ---- **Classification** predicting a label (e.g. hot or cold) ---- **Regression** predicting a continuous value (temperature) --- ### 3: Stages of Machine Learning ---- ![](https://hackmd.io/_uploads/rJ8_fp2In.png) Note: 1. we've covered parts of this in previous weeks, sourcing data through web scraping, SQL and APIs 2. the next step will be preparing the data, ensuring it is good to use, which we will cover on saturday 3. then we we train the model, and check its performance; if it's not performing well, we need to go back to data preparation to further preprocess our data, or collect more data 4. once we're happy with the model's performance, it can be deployed and used in real life 5. in this course, we're not coding the models ourselves; we will be using models from sklearn's library, with some tweaks --- ### 4. using Scikit-learn (Sklearn) Note: today we will be using the linear regression model from sklearn, let's review how to import this module from sklearn use google search to show how I copy paste ---- ``` # Import the model from sklearn.linear_model import LinearRegression # Instanciate the model model = LinearRegression() ``` ---- A review of the workflow ---- Load the dataset and display its head ``` import pandas as pd data = pd.read_csv('path/to/file') data.head() ``` ---- Defining X and y ``` # y is a Series y = data['SalePrice'] # X is a DataFrame X = data[['feature1', 'feature2']] ``` ---- Train the model on the data ``` /* model learns intercepts and coefficients in the background */ model.fit(X, y) ``` ---- Evaluate the model ``` # Evaluate using the default metric (R2) model.score(X,y) ``` ---- Making predictions ``` # new features as a 2D-array new_data = [[1000]] model.predict(new_data) ``` --- ### 5. Holdout method ---- It involves splitting the dataset into two sets: train and test. It is used to evaluate a model’s ability to generalize. ---- Usually, the dataset is split in 70% of training data and 30% of test data. ---- example ``` from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) ``` ---- **Problem with this method** - Data in the Test set is not used to train the model - If you have a small dataset, the information loss could be significant --- ### Solution: Cross-Validation ---- Cross validation generates less biased performance estimates than the Holdout Method. ---- ![](https://hackmd.io/_uploads/S1VGha2I2.png) Note: * The dataset is split into K number of folds, here, we have three * Each fold is used both as training and testing across K number of splits (sub-models) * The average score of all splits is the cross-validated score of the model ---- ``` from sklearn.model_selection import cross_validate cv_results = cross_validate(model, X, y, cv=5) # Scores cv_results['test_score'] # Mean of scores cv_results['test_score'].mean() ``` --- ### Choosing K value ---- Choosing K is a tradeoff between trustworthy performance evaluation and computational expense Rule of thumb: 5 --- ### Learning curve (last part!) ---- Learning curves are used to diagnose overfitting and underfitting in a model. They visualize the effect of training set size on model performance. ---- ![](https://hackmd.io/_uploads/S1nlTpnIn.png) ---- **Good signs** 1. The score for both training and validation curves is relatively high 2. Both curves have converged ---- ![](https://hackmd.io/_uploads/Bk8S6p3U2.png) --- Thank you!