Scikit-learn, Panda, and Kaggles
=====
(2020/12/25 Roy_Chao)
[Kaggle's Titanic part 1](https://colab.research.google.com/drive/1eAmXeDqfIkjc9YQiffwj7ZGZe4R3BCsx?usp=sharing)
[Google Colab File/Folder Structure](https://colab.research.google.com/drive/1UZVmVLPufWLrAoQ7a0ntK38eTpbgj6BG?usp=sharing)
[Scikit-learn practice](https://colab.research.google.com/drive/1xMmmNlAGrKU4aerq98qDQVsPSEnu6t1u?usp=sharing)
Scikit-learn
-----
[Getting Start](https://scikit-learn.org/stable/getting_started.html)
[User Guide](https://scikit-learn.org/stable/user_guide.html#user-guide) NY
[API](https://scikit-learn.org/stable/modules/classes.html#api-ref) NY
[Examples](https://scikit-learn.org/stable/auto_examples/index.html#general-examples) NY
[Tutorials](https://scikit-learn.org/stable/tutorial/index.html#tutorial-menu) NY
[Installation](https://scikit-learn.org/stable/install.html#installation-instructions)
Installation (macOS with conda):`conda install -c conda-forge scikit-learn `
Installation (macOS with pip):`pip install -U scikit-learn`
Installation (Linux with conda):`conda install -c conda-forge scikit-learn `
Installation (Linux with pip3):`pip3 install -U scikit-learn`
Introductions:
- **Estimators**: algorithms and models for ML
`from sklearn.ensemble import RandomForestClassifier`
`clf = RandomForestClassifier(random_state=0)`
- **Fit**: method for fitting data to estimator
`clf.fit(X, y)`
- Generally accepts 2 inputs
- Sample matrix **X**
- size of x : *(n_samples, n_features)*
- Target value **y**
- Regression: real number
- Classification: integer / discrete set of value
- unsupervized: *NULL*
Usually 1D array;
i-th corresponds to the target of i-th sample(row) of x
- **Predict**: predict training data
`clf.predict(X)`
- **Transformers**: do transform method (no predict method)
`from sklearn.preprocessing import StandardScaler`
`StandardScaler().fit(X).transform(X)`
- **Pipelines**: Combination of Estimators and Transformers
`from sklearn.pipeline import make_pipeline`
`pipe = make_pipeline(
StandardScaler(),
LogisticRegression()
)`
- **Load and Split data**: Load and split data
`from sklearn.model_selection import train_test_split`
`from sklearn.datasets import load_iris`
`X, y = load_iris(return_X_y=True)`
`X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)`
- **Accuracy score**: accuracy
`from sklearn.metrics import accuracy_score`
`accuracy_score(pipe.predict(X_test), y_test)`
- **Cross Validation / Model**
`from sklearn.model_selection import cross_validate`
`X, y = make_regression(n_samples=1000, random_state=0)`
`lr = LinearRegression()`
`result = cross_validate(lr, X, y)`
`result['test_score']`
- **Auto learn parameters**
`from sklearn.model_selection import RandomizedSearchCV`
`param_distributions = {'n_estimators': randint(1, 5),'max_depth': randint(5, 10)}`
`search = RandomizedSearchCV( estimator=RandomForestRegressor(random_state=0),n_iter=5,param_distributions=param_distributions,random_state=0)`
`search.fit(X_train, y_train)`
`search.best_params_`
`search.score(X_test, y_test)`
- **Classification Report**
- For those have confusion_matrixs:
```
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test,predictions))
```
- others:
```
from sklearn.metrics import confusion_matrix, classification_report
print(classification_report(y_test,predictions))
```
- Get specific value:
```
from sklearn.metrics import f1_score
f1_score(y_true, y_pred, average='macro')
```
-
- library
- KNN
`from sklearn.neighbors import KNeighborsClassifier`
`neigh = KNeighborsClassifier(n_neighbors=3)`
-
- **sklearn.preprocessing**
- StandardScaler
- `StandardScaler().fit(X).transform(X)`
- `pipe = make_pipeline( StandardScaler(),LogisticRegression() )`
- **sklearn.datasets**
- make_regression
- `X, y = make_regression(n_samples=1000, random_state=0)`
- load_iris
- `X, y = load_iris(return_X_y=True)`
- fetch_california_housing
- `X, y = fetch_california_housing(return_X_y=True)`
- **sklearn.linear_model**
- LogisticRegression
- `pipe = make_pipeline( StandardScaler(),LogisticRegression() )`
- LinearRegression
- `lr = LinearRegression()`
- **sklearn.model_selection**
- train_test_split
- `X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)`
- RandomizedSearchCV
- `search = RandomizedSearchCV( estimator=RandomForestRegressor(random_state=0),n_iter=5,param_distributions=param_distributions,random_state=0)`
- **sklearn.ensemble**
- RandomForestClassifier
- `clf = RandomForestClassifier(random_state=0)`
- `clf.fit(X, y)`
- RandomForestRegressor
- `search = RandomizedSearchCV( estimator=RandomForestRegressor(random_state=0),n_iter=5,param_distributions=param_distributions,random_state=0)`
- **sklearn.pipeline**
- make_pipeline
- `pipe = make_pipeline( StandardScaler(),LogisticRegression() )`
- `pipe.fit(X_train, y_train)`
- `accuracy_score(pipe.predict(X_test), y_test)`
- **sklearn.metrics**
- accuracy_score
- `accuracy_score(pipe.predict(X_test), y_test)`
- **sklearn.model_selection**
- cross_validate
- `result = cross_validate(lr, X, y) # defaults to 5-fold CV`
- `result['test_score'] # r_squared score is high because dataset is easy`

Jupyter Practice:
[**Scikit-learn --Introduction_01**](https://colab.research.google.com/drive/1xMmmNlAGrKU4aerq98qDQVsPSEnu6t1u?usp=sharing)
[Scikit-learn pronunciation](http://www.howtopronounce.cc/scikit-learn)
Pandas
-----
[Installation](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html)
Installation (macOS with conda):`conda install pandas`
Installation (macOS with pip):`pip install pandas`
Installation (Linux with conda):`conda install pandas`
Installation (Linux with pip3):`pip3 install pandas`
- Import
- `import pandas as pd`
- Read .csv
- `train = pd.read_csv("train.csv") #load the data from the system`
- Drop column
- `train = train.drop(['Cabin'], 1, inplace=False) # First dropping 'Cabin' column because it has a lot of null values.`
- `X = train.drop(['Survived', 'PassengerId', 'Name', 'Ticket'], 1, inplace=True) # drop the irrelevant columns and keep the rest`
- Delete empty row
- `train = train.dropna() #delete the rows with empty values`
- Fill empty
- `train['Age'].fillna(train['Age'].median(),inplace=True) # Imputing Missing Age Values`
- Search and select column
- `y = train['Survived'] #select the column representing survival`
- Convert non-numerial Var to dummy Var
- `X = pd.get_dummies(train) # convert non-numerical variables to dummy variables`
- Combine two data sheet
- `combine = pd.concat(train, test)`
- Turn 'a' into 'b' in .csv
- `d = {1:'1st',2:'2nd',3:'3rd'} #Creating a dictionary to convert Passenger Class from 1,2,3 to 1st,2nd,3rd.`
- `train['Pclass'] = train['Pclass'].map(d) #Mapping the column based on the dictionary`
- Check the clean version of the data
- `train.head()`
- Copy
- `train_copy = train.copy()`
- Concat
- 
- if-else
- https://datatofish.com/if-condition-in-pandas-dataframe/
- Show specific lines:
- .iloc
- 
Conda (Environment)(Recommanded)
-----
[Installation(Linux)](https://phoenixnap.com/kb/how-to-install-anaconda-ubuntu-18-04-or-20-04)
[Installation(MacOS)](https://docs.conda.io/projects/conda/en/4.6.1/user-guide/install/macos.html)
version:`conda -V`
update:`conda update conda`
list env:`conda env list`
.
create env:`conda create --name myenv python=3.7`
- Ex:`conda create --name py37 python=3.7`
Enter env:`activate myenv`or`source activate myenv`
- Ex:`activate py37`
Leave env:`deactivate`
.
Install packages:`conda install package_name`
- Ex:`conda install numpy`
Remove packages:`conda remove --name myenv package_name`
- Ex:`conda remove --name py37 numpy`
Remove env:`conda remove --name py37`
Kaggle’s Titanic Competition in 10 Minutes
-----
[Kaggle's Competition in 10 minutes - Part 1](https://towardsdatascience.com/kaggles-titanic-competition-in-10-minutes-part-i-e6d18e59dbce)
- How to import Kaggle's dataset into Google Colab?
- [Colab note](https://colab.research.google.com/drive/1gNbZ-ffGwdVQGbYYtqGum6_Lniw03IAG?usp=sharing)
- Kaggle API update: `pip install kaggle --upgrade`
- Force Kaggle API update( google colab ): `!pip install --upgrade --force-reinstall --no-deps kaggle`
- Colab Practice:
- [Kaggle's Competition Part 1](https://colab.research.google.com/drive/1eAmXeDqfIkjc9YQiffwj7ZGZe4R3BCsx?usp=sharing)
[Kaggle's Competition in 10 minutes - Part 2](https://towardsdatascience.com/kaggles-titanic-competition-in-10-minutes-part-ii-3ae626bc6519)
- 還沒看
[Kaggle's Competition in 10 minutes - Part 3](https://towardsdatascience.com/kaggles-titanic-competition-in-10-minutes-part-iii-a492a1a1604f)
- 還沒看
Google Colab folder structure
-----
[Colab testing](https://colab.research.google.com/drive/1UZVmVLPufWLrAoQ7a0ntK38eTpbgj6BG?usp=sharing)
- IPython's shell
- use `%cd` instead of `!cd`
- structure
- / `%cd ..`
- /root `%cd ~`
- /content `%cd ../content` **Defalut Folder**
```
/
total 104
drwxr-xr-x 1 root root 4096 Dec 26 02:52 ./
drwxr-xr-x 1 root root 4096 Dec 26 02:52 ../
drwxr-xr-x 1 root root 4096 Dec 21 17:21 bin/
drwxr-xr-x 2 root root 4096 Apr 24 2018 boot/
drwxr-xr-x 1 root root 4096 Dec 21 17:29 content/
drwxr-xr-x 1 root root 4096 Dec 22 17:18 datalab/
drwxr-xr-x 5 root root 360 Dec 26 02:52 dev/
-rwxr-xr-x 1 root root 0 Dec 26 02:52 .dockerenv*
drwxr-xr-x 1 root root 4096 Dec 26 02:52 etc/
drwxr-xr-x 2 root root 4096 Apr 24 2018 home/
drwxr-xr-x 1 root root 4096 Dec 21 17:23 lib/
drwxr-xr-x 2 root root 4096 Dec 21 17:14 lib32/
drwxr-xr-x 1 root root 4096 Dec 21 17:14 lib64/
drwxr-xr-x 2 root root 4096 Sep 21 17:14 media/
drwxr-xr-x 2 root root 4096 Sep 21 17:14 mnt/
drwxr-xr-x 1 root root 4096 Dec 21 17:24 opt/
dr-xr-xr-x 110 root root 0 Dec 26 02:52 proc/
drwx------ 1 root root 4096 Dec 26 02:52 root/
drwxr-xr-x 1 root root 4096 Dec 21 17:17 run/
drwxr-xr-x 1 root root 4096 Dec 21 17:21 sbin/
drwxr-xr-x 2 root root 4096 Sep 21 17:14 srv/
drwxr-xr-x 4 root root 4096 Dec 21 17:58 swift/
dr-xr-xr-x 12 root root 0 Dec 26 02:54 sys/
drwxr-xr-x 4 root root 4096 Dec 21 17:54 tensorflow-1.15.2/
drwxrwxrwt 1 root root 4096 Dec 26 02:53 tmp/
drwxr-xr-x 1 root root 4096 Dec 22 17:18 tools/
drwxr-xr-x 1 root root 4096 Dec 21 17:24 usr/
drwxr-xr-x 1 root root 4096 Dec 26 02:52 var/
```
```
/root
total 60
drwx------ 1 root root 4096 Dec 26 02:52 ./
drwxr-xr-x 1 root root 4096 Dec 26 02:52 ../
-r-xr-xr-x 1 root root 1169 Jan 1 2000 .bashrc*
drwxr-xr-x 1 root root 4096 Dec 22 17:20 .cache/
drwxr-xr-x 1 root root 4096 Dec 22 17:18 .config/
drwxr-xr-x 3 root root 4096 Dec 21 17:29 .gsutil/
drwxr-xr-x 1 root root 4096 Dec 22 17:18 .ipython/
drwx------ 2 root root 4096 Dec 22 17:18 .jupyter/
drwxr-xr-x 2 root root 4096 Dec 26 02:52 .keras/
drwx------ 1 root root 4096 Dec 22 17:18 .local/
drwxr-xr-x 4 root root 4096 Dec 22 17:18 .npm/
-rw-r--r-- 1 root root 148 Aug 17 2015 .profile
```
Questions
-----
1. What's the difference between tensorflow and scikit-learn?
- Machine Learning:
- scikit-learn ( high-level library)
- Deep Learning:
- tensorflow ( low-level library )
- tensorflow( lower-level )
- Keras( high-level )
- PyTorch ( lower-level library )
- PyTorch( lower-level )
- nn ( high-level library )
2. What scikit-learn can do?
- Classification
- Regression
- Clustering
- Dimensionality reduction
- Model selection
- Preprocessing
3.
Log
-----
2020.12.25
- Scikit-learn
- Getting Start
- [Colab Practice 1](https://colab.research.google.com/drive/1xMmmNlAGrKU4aerq98qDQVsPSEnu6t1u?usp=sharing)
- Conda
- Kaggle's Titanic Competition in 10 minutes I
- [Colab Import data from Kaggle](https://colab.research.google.com/drive/1gNbZ-ffGwdVQGbYYtqGum6_Lniw03IAG?usp=sharing)
- Question 1, 2
2020.12.26
- Pandas
- Kaggle's Titanic Competition in 10 minutes I
- [Colab Practice2 ](https://colab.research.google.com/drive/1eAmXeDqfIkjc9YQiffwj7ZGZe4R3BCsx?usp=sharing)
- Figure out the structure of Colab
- [Colab test](https://colab.research.google.com/drive/1UZVmVLPufWLrAoQ7a0ntK38eTpbgj6BG?usp=sharing)
Not-yet:
- Kaggle's Titanic Competition in 10 minutes II
- Kaggle's Titanic Competition in 10 minutes III
Vocabulary
-----
- imputed
- (V.) 估算
- impute sth to sb: 把...歸咎於(某人)