機器學習Day001 ~ 016

###### tags: `Machine Learning` `Python` `Author:John Chen` # 機器學習Day001 ~ 016 --- ## Day001：資料介紹與評估資料 ![](https://i.imgur.com/Zv90yHh.png) ![](https://i.imgur.com/TO5kiw6.png) ### Day1 Homework --- 迴歸問題的評估指標，計算MAE(Mean Absolute Error)與MSE(Mean Square Error) ### **Addition:** * [**回歸分析**](https://zh.wikipedia.org/wiki/%E8%BF%B4%E6%AD%B8%E5%88%86%E6%9E%90) * [**線性回歸**](https://zh.wikipedia.org/wiki/%E7%B7%9A%E6%80%A7%E5%9B%9E%E6%AD%B8) * [**標準分數**](https://zh.wikipedia.org/wiki/%E6%A8%99%E6%BA%96%E5%88%86%E6%95%B8) * [**Data Scientist、Data Analyst、Data Engineer 的區別是什麼？(英文)**](https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer) * [**Data Scientist、Data Analyst、Data Engineer 的區別是什麼？(簡中)**](https://www.zhihu.com/question/23946233) ![](https://i.imgur.com/INKJuRn.png) --- ## Day002：資料清理理數據前處理 --- EDA-1/讀取資料 ![](https://i.imgur.com/RT3vNmL.png) ![](https://i.imgur.com/zbfCPfH.png) ![](https://i.imgur.com/bhZH8Z7.png) ### Addition: * [**CSV資料操作參考**](https://bookdata.readthedocs.io/en/latest/base/01_pandas.html#pandas-%E5%AF%B9%E8%B1%A1) * [**What is Exploratory Data Analysis?**](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15) --- ## Day003 ### **Addition:** * [**try & except**](http://www.runoob.com/python/python-exceptions.html) * [**Pandas Exercise**](https://github.com/sunnike/pandas_exercises) --- ## Day004 ![](https://i.imgur.com/2sA2jnw.png) ![](https://i.imgur.com/4SR4hc4.png) :::info If you’re new to Machine Learning, you might get confused between these two — Label Encoder and One Hot Encoder. These two encoders are parts of the SciKit Learn library in Python, and they are used to convert categorical data, or text data, into numbers, which our predictive models can better understand. Today, let’s understand the difference between the two with a simple example. ### Label Encoding To begin with, you can find the SciKit Learn documentation for Label Encoder here. Now, let’s consider the following data: ![](https://i.imgur.com/aGQQSZy.png) In this example, the first column is the country column, which is all text. As you might know by now, we can’t have text in our data if we’re going to run any kind of model on it. So before we can run a model, we need to make this data ready for the model. And to convert this kind of categorical text data into model-understandable numerical data, we use the Label Encoder class. So all we have to do, to label encode the first column, is import the LabelEncoder class from the sklearn library, fit and transform the first column of the data, and then replace the existing text data with the new encoded data. Let’s have a look at the code. ```python= from sklearn.preprocessing import LabelEncoder labelencoder = LabelEncoder() x[:, 0] = labelencoder.fit_transform(x[:, 0]) ``` We’ve assumed that the data is in a variable called ‘x’. After running this piece of code, if you check the value of x, you’ll see that the three countries in the first column have been replaced by the numbers 0, 1, and 2. ![](https://i.imgur.com/a1zaPYk.png) That’s all label encoding is about. But depending on the data, label encoding introduces a new problem. For example, we have encoded a set of country names into numerical data. This is actually categorical data and there is no relation, of any kind, between the rows. ==The problem here is, since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 < 2. But this isn’t the case at all. To overcome this problem, we use One Hot Encoder.== ### One Hot Encoder If you’re interested in checking out the documentation, you can find it here. Now, as we already discussed, depending on the data we have, we might run into situations where, after label encoding, we might confuse our model into thinking that a column has data with some kind of order or hierarchy, when we clearly don’t have it. To avoid this, we ‘OneHotEncode’ that column. What one hot encoding does is, it takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value. In our example, we’ll get three new columns, one for each country — France, Germany, and Spain. For rows which have the first column value as France, the ‘France’ column will have a ‘1’ and the other two columns will have ‘0’s. Similarly, for rows which have the first column value as Germany, the ‘Germany’ column will have a ‘1’ and the other two columns will have ‘0’s. The Python code for one hot encoding is also pretty simple: ```python= from sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder(categorical_features = [0]) x = onehotencoder.fit_transform(x).toarray() ``` As you can see in the constructor, we specify which column has to be one hot encoded, [0] in this case. Then we fit and transform the array ‘x’ with the onehotencoder object we just created. And that’s it, we now have three new columns in our dataset: ![](https://i.imgur.com/gQoyApK.png) As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent. So, that’s the difference between Label Encoding and One Hot Encoding. ::: ![](https://i.imgur.com/wcvTF7u.png) ### Addition: [**Label Encoder vs. One Hot Encoder in Machine Learning**](https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621) ```python= import os import numpy as np import pandas as pd # 設定 data_path, 並讀取 app_train dir_data = './data/' f_app_train = os.path.join(dir_data, 'application_train.csv') app_train = pd.read_csv(f_app_train) # 作業 # 將下列部分資料片段 sub_train 使用 One Hot encoding, 並觀察轉換前後的欄位數量 (使用 shape) 與欄位名稱 (使用 head) 變化 sub_train = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START']) print(sub_train.shape) print(sub_train.head()) sub_train = pd.get_dummies(sub_train) print(sub_train.shape) print(sub_train.head()) ``` --- ## Day005: EDA之資料分布 ![](https://i.imgur.com/X7cop1Q.png) ![](https://i.imgur.com/EaTvzBU.png) ![](https://i.imgur.com/glcACAo.png) ### [**matplotlib**](https://matplotlib.org/gallery/index.html) ### [**seaborn**](https://matplotlib.org/gallery/index.html) ### **Addition:** * [**何謂標準差**](https://zh.wikipedia.org/wiki/%E6%A8%99%E6%BA%96%E5%B7%AE) * [**Standard Statistical Distributions**](https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1b-statistical-methods/statistical-distributions) * [**List of probability distributions**](https://en.wikipedia.org/wiki/List_of_probability_distributions) --- ## Day006 ![](https://i.imgur.com/5sl9f2D.png) ![](https://i.imgur.com/DEXcKng.png) ### **Addition:** * [**四分位數**](https://zh.wikipedia.org/wiki/%E5%9B%9B%E5%88%86%E4%BD%8D%E6%95%B0) * [**常態分佈**](https://zh.wikipedia.org/wiki/%E6%AD%A3%E6%80%81%E5%88%86%E5%B8%83) * [**經驗分布函數**](https://zh.wikipedia.org/wiki/%E7%BB%8F%E9%AA%8C%E5%88%86%E5%B8%83%E5%87%BD%E6%95%B0) * [**How to Use Statistics to Identify Outliers in Data**](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/) * [**Ways to Detect and Remove the Outliers**](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba) * [**箱型圖範例**](http://estat.ncku.edu.tw/nsc/flash/topic/graph_stat/base/BoxPlot.html) --- ## Day007: 常⽤數值取代：中位數與分位數連續數值標準化 ![](https://i.imgur.com/hbilwxG.png) ![](https://i.imgur.com/kmafqkx.png) ![](https://i.imgur.com/LBsVOqH.png) :::info ### Some times when normalizing is good: 1) Several algorithms, in particular SVMs come to mind, can sometimes converge far faster on normalized data (although why, precisely, I can't recall). 2) When your model is sensitive to magnitude, and the units of two different features are different, and arbitrary. This is like the case you suggest, in which something gets more influence than it should. But of course -- not all algorithms are sensitive to magnitude in the way you suggest. Linear regression coefficients will be identical if you do, or don't, scale your data, because it's looking at proportional relationships between them. ### Some times when normalizing is bad: 1) When you want to interpret your coefficients, and they don't normalize well. Regression on something like dollars gives you a meaningful outcome. Regression on proportion-of-maximum-dollars-in-sample might not. 2) When, in fact, the units on your features are meaningful, and distance does make a difference! Back to SVMs -- if you're trying to find a max-margin classifier, then the units that go into that 'max' matter. Scaling features for clustering algorithms can substantially change the outcome. Imagine four clusters around the origin, each one in a different quadrant, all nicely scaled. Now, imagine the y-axis being stretched to ten times the length of the the x-axis. instead of four little quadrant-clusters, you're going to get the long squashed baguette of data chopped into four pieces along its length! (And, the important part is, you might prefer either of these!) In I'm sure unsatisfying summary, the most general answer is that you need to ask yourself seriously what makes sense with the data, and model, you're using. ::: 閱讀重點：有的時候好，有得時候不好 (但爭議仍在，僅供參考) Good 某些演算法 (如 SVM, DL) 等，對權重敏感或對損失函數平滑程度有幫助者特徵間的量級差異甚大 Bad 有些指標，如相關不適合在有標準化的空間進行量的單位在某些特徵上是有意義的 --- ## Day008: 資料清理理數據前處理理---常用的 DataFrame 操作 ![](https://i.imgur.com/wBy9ANf.png) ![](https://i.imgur.com/7dtfHoD.png) ![](https://i.imgur.com/tQLkUtC.png) ![](https://i.imgur.com/2wNkvwX.png) ![](https://i.imgur.com/DzxinUY.png) ### Addition: * [**groupby.apply 範例**](https://blog.csdn.net/u011523796/article/details/74360772) * [**cut & qcut**](https://medium.com/@morris_tai/pandas%E7%9A%84cut-qcut%E5%87%BD%E6%95%B8-93c244e34cfc) * [**pandas.DataFrame.boxplot**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) * [**matplotlib.pyplot.boxplot**](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html#matplotlib.pyplot.boxplot) --- ## Day009: 探索式數據分析---相關係數簡介 ![](https://i.imgur.com/MJ1xJGn.png) ![](https://i.imgur.com/kRg7GoW.png) ![](https://i.imgur.com/SBzsrbP.png) ![](https://i.imgur.com/qbHYQjX.png) ## Addition: * [**隨機變量**](https://zh.wikipedia.org/wiki/%E9%9A%8F%E6%9C%BA%E5%8F%98%E9%87%8F) * [**相關係數推演**](https://www.itread01.com/content/1550172997.html) --- ## Day010: 探索式數據分析---相關係數實作 ![](https://i.imgur.com/9gxePlh.png) ![](https://i.imgur.com/42ir4zg.png) ![](https://i.imgur.com/CNMC55T.png) ![](https://i.imgur.com/c67hxql.png) ![](https://i.imgur.com/gE70Mt9.png) ### Addition: * [**協方差（Covariance,COV）**](https://wiki.mbalib.com/zh-tw/%E5%8D%8F%E6%96%B9%E5%B7%AE) --- ## Day011: 探索式數據分析-繪圖與樣式＆Kernel Density Estimation (KDE) ![](https://i.imgur.com/TD5Y53E.png) ![](https://i.imgur.com/i0GPOQq.png) ![](https://i.imgur.com/2nK7BkQ.png) ![](https://i.imgur.com/Ac9PLJF.png) ![](https://i.imgur.com/feQF1NB.png) ### Addition: * [**KDE**](https://blog.csdn.net/unixtch/article/details/78556499) * [**KDE實例**](https://scikit-learn.org/stable/auto_examples/neighbors/plot_kde_1d.html#sphx-glr-auto-examples-neighbors-plot-kde-1d-py) --- ## Day012: 探索式數據分析-離散化與EDA ![](https://i.imgur.com/C3y35ow.png) ![](https://i.imgur.com/bRA6nnB.png) ![](https://i.imgur.com/G7y68qg.png) ![](https://i.imgur.com/rCC3KF9.png) ![](https://i.imgur.com/qoAlInM.png) ## Addition: * [**連續特徵的離散化 : 在什麼情況下可以獲得更好的效果(知乎)**](https://www.zhihu.com/question/31989952) --- ## Day013: 探索式數據分析-把連續變數離散化 ![](https://i.imgur.com/9UA4vst.png) ![](https://i.imgur.com/bVPR59G.png) --- ## Day014: ![](https://i.imgur.com/hprN9Nu.png) ![](https://i.imgur.com/U5Nx5Oe.png) ![](https://i.imgur.com/sWzR3b5.png) ![](https://i.imgur.com/euamoRq.png) ### Addition: * [**subplot 的排版範例**](https://matplotlib.org/examples/pylab_examples/subplots_demo.html) * [**Multiple Subplots**](https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html) * [**seaborn.jointplot**](https://seaborn.pydata.org/generated/seaborn.jointplot.html) --- ## Day015: 探索式數據分析常用圖形-HeatMap 與 GridPlot ![](https://i.imgur.com/sHfaYj0.png) ![](https://i.imgur.com/bGppdH7.png) ![](https://i.imgur.com/bhw2sHe.png) ![](https://i.imgur.com/Z60h204.png) ![](https://i.imgur.com/TiJVg2X.png) ### Addition: * [**進階 Heatmap範例**](https://www.jianshu.com/p/363bbf6ec335) * [**Visualizing Data with Pairs Plots in Python**](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166) --- ## Day016: ![](https://i.imgur.com/c7N7LCE.png)