Scikit-learn, Panda, and Kaggles ===== (2020/12/25 Roy_Chao) [Kaggle's Titanic part 1](https://colab.research.google.com/drive/1eAmXeDqfIkjc9YQiffwj7ZGZe4R3BCsx?usp=sharing) [Google Colab File/Folder Structure](https://colab.research.google.com/drive/1UZVmVLPufWLrAoQ7a0ntK38eTpbgj6BG?usp=sharing) [Scikit-learn practice](https://colab.research.google.com/drive/1xMmmNlAGrKU4aerq98qDQVsPSEnu6t1u?usp=sharing) Scikit-learn ----- [Getting Start](https://scikit-learn.org/stable/getting_started.html) [User Guide](https://scikit-learn.org/stable/user_guide.html#user-guide) NY [API](https://scikit-learn.org/stable/modules/classes.html#api-ref) NY [Examples](https://scikit-learn.org/stable/auto_examples/index.html#general-examples) NY [Tutorials](https://scikit-learn.org/stable/tutorial/index.html#tutorial-menu) NY [Installation](https://scikit-learn.org/stable/install.html#installation-instructions) Installation (macOS with conda):`conda install -c conda-forge scikit-learn ` Installation (macOS with pip):`pip install -U scikit-learn` Installation (Linux with conda):`conda install -c conda-forge scikit-learn ` Installation (Linux with pip3):`pip3 install -U scikit-learn` Introductions: - **Estimators**: algorithms and models for ML `from sklearn.ensemble import RandomForestClassifier` `clf = RandomForestClassifier(random_state=0)` - **Fit**: method for fitting data to estimator `clf.fit(X, y)` - Generally accepts 2 inputs - Sample matrix **X** - size of x : *(n_samples, n_features)* - Target value **y** - Regression: real number - Classification: integer / discrete set of value - unsupervized: *NULL* Usually 1D array; i-th corresponds to the target of i-th sample(row) of x - **Predict**: predict training data `clf.predict(X)` - **Transformers**: do transform method (no predict method) `from sklearn.preprocessing import StandardScaler` `StandardScaler().fit(X).transform(X)` - **Pipelines**: Combination of Estimators and Transformers `from sklearn.pipeline import make_pipeline` `pipe = make_pipeline( StandardScaler(), LogisticRegression() )` - **Load and Split data**: Load and split data `from sklearn.model_selection import train_test_split` `from sklearn.datasets import load_iris` `X, y = load_iris(return_X_y=True)` `X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)` - **Accuracy score**: accuracy `from sklearn.metrics import accuracy_score` `accuracy_score(pipe.predict(X_test), y_test)` - **Cross Validation / Model** `from sklearn.model_selection import cross_validate` `X, y = make_regression(n_samples=1000, random_state=0)` `lr = LinearRegression()` `result = cross_validate(lr, X, y)` `result['test_score']` - **Auto learn parameters** `from sklearn.model_selection import RandomizedSearchCV` `param_distributions = {'n_estimators': randint(1, 5),'max_depth': randint(5, 10)}` `search = RandomizedSearchCV( estimator=RandomForestRegressor(random_state=0),n_iter=5,param_distributions=param_distributions,random_state=0)` `search.fit(X_train, y_train)` `search.best_params_` `search.score(X_test, y_test)` - **Classification Report** - For those have confusion_matrixs: ``` from sklearn.metrics import confusion_matrix, classification_report print(confusion_matrix(y_test, predictions)) print(classification_report(y_test,predictions)) ``` - others: ``` from sklearn.metrics import confusion_matrix, classification_report print(classification_report(y_test,predictions)) ``` - Get specific value: ``` from sklearn.metrics import f1_score f1_score(y_true, y_pred, average='macro') ``` - - library - KNN `from sklearn.neighbors import KNeighborsClassifier` `neigh = KNeighborsClassifier(n_neighbors=3)` - - **sklearn.preprocessing** - StandardScaler - `StandardScaler().fit(X).transform(X)` - `pipe = make_pipeline( StandardScaler(),LogisticRegression() )` - **sklearn.datasets** - make_regression - `X, y = make_regression(n_samples=1000, random_state=0)` - load_iris - `X, y = load_iris(return_X_y=True)` - fetch_california_housing - `X, y = fetch_california_housing(return_X_y=True)` - **sklearn.linear_model** - LogisticRegression - `pipe = make_pipeline( StandardScaler(),LogisticRegression() )` - LinearRegression - `lr = LinearRegression()` - **sklearn.model_selection** - train_test_split - `X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)` - RandomizedSearchCV - `search = RandomizedSearchCV( estimator=RandomForestRegressor(random_state=0),n_iter=5,param_distributions=param_distributions,random_state=0)` - **sklearn.ensemble** - RandomForestClassifier - `clf = RandomForestClassifier(random_state=0)` - `clf.fit(X, y)` - RandomForestRegressor - `search = RandomizedSearchCV( estimator=RandomForestRegressor(random_state=0),n_iter=5,param_distributions=param_distributions,random_state=0)` - **sklearn.pipeline** - make_pipeline - `pipe = make_pipeline( StandardScaler(),LogisticRegression() )` - `pipe.fit(X_train, y_train)` - `accuracy_score(pipe.predict(X_test), y_test)` - **sklearn.metrics** - accuracy_score - `accuracy_score(pipe.predict(X_test), y_test)` - **sklearn.model_selection** - cross_validate - `result = cross_validate(lr, X, y) # defaults to 5-fold CV` - `result['test_score'] # r_squared score is high because dataset is easy` ![](https://i.imgur.com/hIjWXIt.png) Jupyter Practice: [**Scikit-learn --Introduction_01**](https://colab.research.google.com/drive/1xMmmNlAGrKU4aerq98qDQVsPSEnu6t1u?usp=sharing) [Scikit-learn pronunciation](http://www.howtopronounce.cc/scikit-learn) Pandas ----- [Installation](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html) Installation (macOS with conda):`conda install pandas` Installation (macOS with pip):`pip install pandas` Installation (Linux with conda):`conda install pandas` Installation (Linux with pip3):`pip3 install pandas` - Import - `import pandas as pd` - Read .csv - `train = pd.read_csv("train.csv") #load the data from the system` - Drop column - `train = train.drop(['Cabin'], 1, inplace=False) # First dropping 'Cabin' column because it has a lot of null values.` - `X = train.drop(['Survived', 'PassengerId', 'Name', 'Ticket'], 1, inplace=True) # drop the irrelevant columns and keep the rest` - Delete empty row - `train = train.dropna() #delete the rows with empty values` - Fill empty - `train['Age'].fillna(train['Age'].median(),inplace=True) # Imputing Missing Age Values` - Search and select column - `y = train['Survived'] #select the column representing survival` - Convert non-numerial Var to dummy Var - `X = pd.get_dummies(train) # convert non-numerical variables to dummy variables` - Combine two data sheet - `combine = pd.concat(train, test)` - Turn 'a' into 'b' in .csv - `d = {1:'1st',2:'2nd',3:'3rd'} #Creating a dictionary to convert Passenger Class from 1,2,3 to 1st,2nd,3rd.` - `train['Pclass'] = train['Pclass'].map(d) #Mapping the column based on the dictionary` - Check the clean version of the data - `train.head()` - Copy - `train_copy = train.copy()` - Concat - ![](https://i.imgur.com/Z6QPjNg.png) - if-else - https://datatofish.com/if-condition-in-pandas-dataframe/ - Show specific lines: - .iloc - ![](https://i.imgur.com/pSAYQpg.png) Conda (Environment)(Recommanded) ----- [Installation(Linux)](https://phoenixnap.com/kb/how-to-install-anaconda-ubuntu-18-04-or-20-04) [Installation(MacOS)](https://docs.conda.io/projects/conda/en/4.6.1/user-guide/install/macos.html) version:`conda -V` update:`conda update conda` list env:`conda env list` . create env:`conda create --name myenv python=3.7` - Ex:`conda create --name py37 python=3.7` Enter env:`activate myenv`or`source activate myenv` - Ex:`activate py37` Leave env:`deactivate` . Install packages:`conda install package_name` - Ex:`conda install numpy` Remove packages:`conda remove --name myenv package_name` - Ex:`conda remove --name py37 numpy` Remove env:`conda remove --name py37` Kaggle’s Titanic Competition in 10 Minutes ----- [Kaggle's Competition in 10 minutes - Part 1](https://towardsdatascience.com/kaggles-titanic-competition-in-10-minutes-part-i-e6d18e59dbce) - How to import Kaggle's dataset into Google Colab? - [Colab note](https://colab.research.google.com/drive/1gNbZ-ffGwdVQGbYYtqGum6_Lniw03IAG?usp=sharing) - Kaggle API update: `pip install kaggle --upgrade` - Force Kaggle API update( google colab ): `!pip install --upgrade --force-reinstall --no-deps kaggle` - Colab Practice: - [Kaggle's Competition Part 1](https://colab.research.google.com/drive/1eAmXeDqfIkjc9YQiffwj7ZGZe4R3BCsx?usp=sharing) [Kaggle's Competition in 10 minutes - Part 2](https://towardsdatascience.com/kaggles-titanic-competition-in-10-minutes-part-ii-3ae626bc6519) - 還沒看 [Kaggle's Competition in 10 minutes - Part 3](https://towardsdatascience.com/kaggles-titanic-competition-in-10-minutes-part-iii-a492a1a1604f) - 還沒看 Google Colab folder structure ----- [Colab testing](https://colab.research.google.com/drive/1UZVmVLPufWLrAoQ7a0ntK38eTpbgj6BG?usp=sharing) - IPython's shell - use `%cd` instead of `!cd` - structure - / `%cd ..` - /root `%cd ~` - /content `%cd ../content` **Defalut Folder** ``` / total 104 drwxr-xr-x 1 root root 4096 Dec 26 02:52 ./ drwxr-xr-x 1 root root 4096 Dec 26 02:52 ../ drwxr-xr-x 1 root root 4096 Dec 21 17:21 bin/ drwxr-xr-x 2 root root 4096 Apr 24 2018 boot/ drwxr-xr-x 1 root root 4096 Dec 21 17:29 content/ drwxr-xr-x 1 root root 4096 Dec 22 17:18 datalab/ drwxr-xr-x 5 root root 360 Dec 26 02:52 dev/ -rwxr-xr-x 1 root root 0 Dec 26 02:52 .dockerenv* drwxr-xr-x 1 root root 4096 Dec 26 02:52 etc/ drwxr-xr-x 2 root root 4096 Apr 24 2018 home/ drwxr-xr-x 1 root root 4096 Dec 21 17:23 lib/ drwxr-xr-x 2 root root 4096 Dec 21 17:14 lib32/ drwxr-xr-x 1 root root 4096 Dec 21 17:14 lib64/ drwxr-xr-x 2 root root 4096 Sep 21 17:14 media/ drwxr-xr-x 2 root root 4096 Sep 21 17:14 mnt/ drwxr-xr-x 1 root root 4096 Dec 21 17:24 opt/ dr-xr-xr-x 110 root root 0 Dec 26 02:52 proc/ drwx------ 1 root root 4096 Dec 26 02:52 root/ drwxr-xr-x 1 root root 4096 Dec 21 17:17 run/ drwxr-xr-x 1 root root 4096 Dec 21 17:21 sbin/ drwxr-xr-x 2 root root 4096 Sep 21 17:14 srv/ drwxr-xr-x 4 root root 4096 Dec 21 17:58 swift/ dr-xr-xr-x 12 root root 0 Dec 26 02:54 sys/ drwxr-xr-x 4 root root 4096 Dec 21 17:54 tensorflow-1.15.2/ drwxrwxrwt 1 root root 4096 Dec 26 02:53 tmp/ drwxr-xr-x 1 root root 4096 Dec 22 17:18 tools/ drwxr-xr-x 1 root root 4096 Dec 21 17:24 usr/ drwxr-xr-x 1 root root 4096 Dec 26 02:52 var/ ``` ``` /root total 60 drwx------ 1 root root 4096 Dec 26 02:52 ./ drwxr-xr-x 1 root root 4096 Dec 26 02:52 ../ -r-xr-xr-x 1 root root 1169 Jan 1 2000 .bashrc* drwxr-xr-x 1 root root 4096 Dec 22 17:20 .cache/ drwxr-xr-x 1 root root 4096 Dec 22 17:18 .config/ drwxr-xr-x 3 root root 4096 Dec 21 17:29 .gsutil/ drwxr-xr-x 1 root root 4096 Dec 22 17:18 .ipython/ drwx------ 2 root root 4096 Dec 22 17:18 .jupyter/ drwxr-xr-x 2 root root 4096 Dec 26 02:52 .keras/ drwx------ 1 root root 4096 Dec 22 17:18 .local/ drwxr-xr-x 4 root root 4096 Dec 22 17:18 .npm/ -rw-r--r-- 1 root root 148 Aug 17 2015 .profile ``` Questions ----- 1. What's the difference between tensorflow and scikit-learn? - Machine Learning: - scikit-learn ( high-level library) - Deep Learning: - tensorflow ( low-level library ) - tensorflow( lower-level ) - Keras( high-level ) - PyTorch ( lower-level library ) - PyTorch( lower-level ) - nn ( high-level library ) 2. What scikit-learn can do? - Classification - Regression - Clustering - Dimensionality reduction - Model selection - Preprocessing 3. Log ----- 2020.12.25 - Scikit-learn - Getting Start - [Colab Practice 1](https://colab.research.google.com/drive/1xMmmNlAGrKU4aerq98qDQVsPSEnu6t1u?usp=sharing) - Conda - Kaggle's Titanic Competition in 10 minutes I - [Colab Import data from Kaggle](https://colab.research.google.com/drive/1gNbZ-ffGwdVQGbYYtqGum6_Lniw03IAG?usp=sharing) - Question 1, 2 2020.12.26 - Pandas - Kaggle's Titanic Competition in 10 minutes I - [Colab Practice2 ](https://colab.research.google.com/drive/1eAmXeDqfIkjc9YQiffwj7ZGZe4R3BCsx?usp=sharing) - Figure out the structure of Colab - [Colab test](https://colab.research.google.com/drive/1UZVmVLPufWLrAoQ7a0ntK38eTpbgj6BG?usp=sharing) Not-yet: - Kaggle's Titanic Competition in 10 minutes II - Kaggle's Titanic Competition in 10 minutes III Vocabulary ----- - imputed - (V.) 估算 - impute sth to sb: 把...歸咎於(某人)