機器學習百日馬拉松 100Days of ML Coding Day-003

機器學習百日馬拉松 100Days of ML Coding Day-003 === ###### tags: `100DaysOfMLCode` 機器學習百日馬拉松Day-002 有討論到如何讀去CSV資料，但現實狀況是，在進行資料處理的過程中，並非每一個拿到的檔案都會是CSV這種格式，我們如何將結構性資料利用 Python 套件來創建類似於CSV的型態 ? 又，我們怎麼讀取各式各樣非結構性的資料 ? ## What's Series ? 在學習如何創建 DataFrame以前，我們要先知道什麼是 Series 。 Series 是 Pandas 的核心結構之一，型態界於 List 與 Dictionary之間，有 index 、可供檢索、有真實數據 ( value ) ，而資料本身也有標籤。 ### 創建 Series : * #### 使用列表 ( List ) 創建 Series : `pd.Series([value],index=[列label])` ```python= >>> A=[1,2,3,4,5,6] >>> pd.Series(A,index=[2,1,8,5,6,4]) 2 1 1 2 8 3 5 4 6 5 4 6 dtype: int64 ``` 如果我們沒有設定 index，則系統會給初從0開始的預設 index。 * #### 使用字典 ( Dictionary ) 創建 Series : `pd.Series({列label : value},index=[key of dict])` ```python= >>> pd.Series({"a":1,"b":2,"c":3}) a 1 b 2 c 3 dtype: int64 ``` 我們也可以利用index 參數進行要表列的選項及順序，若我們所指定的index並不存在，pandas會自動將這不存在的index之值以 NaN 填補。 ```python= >>>　pd.Series({"a":1,"b":2,"c":3,"d":4,"e":5},index=["e","b","d","z"]) e 5.0 b 2.0 d 4.0 z NaN dtype: float64 ``` ## What's DataFrame ? 我們可以把 DataFrame 視為二維的 Series ![](https://i.imgur.com/MmdIHmb.png =250x) ### 創建DataFrame : * #### 以多個 Series作為一個列表創建 DataFrame : `pd.DataFrame([pd.Series],index=[列label]})` ```python= A=pd.Series({"a":1,"b":2,"c":3}) B=pd.Series({"a":4,"b":5,"c":6}) C=pd.Series({"c":7,"c":7.5,"d":8,"e":9,"f":10}) df=pd.DataFrame[A,B,C],index=(["Series_1","Series_2","Series_3"]) ``` ![](https://i.imgur.com/I9AFKSG.png =300x) 這邊有個地方需要注意，我們在創建 Series 時，Series會以「行」來呈現，但我們如果利用 Series 來創建 DataFrame 時，每一個 Series 反而是以「列」來呈現。我自己在這一部分的確會有混淆的狀況。在創建 DataFrame 時，究竟我們的元素 (Series、List、Dictionary...) 會怎麼呈現在其中，這是需要花一點時間來理解的。 * #### 利用字典來指定 Series進行創建 DataFrame : `pd.DataFrame({"行label":pd.Series})` ```python= A=pd.Series({"a":1,"b":2,"c":3}) B=pd.Series({"a":4,"b":5,"c":6}) C=pd.Series({"c":7,"c":7.5,"d":8,"e":9,"f":10}) df=pd.DataFrame({"Series_1":A,"Series_2":B,"Series_3":C}) ``` ![](https://i.imgur.com/gctJIev.png =230x) 此處利用字典方式指定 Series 為「行」label的值，所呈現出來的就會變成每一個 Series 以「行」來呈現，有別於上面利用 Series 列表創建 DataFrame所呈現的效果。(這兩個DataFrame互為轉置) **<font color="#dd0000">若上例的 A,B,C只是單純的列表 List ( 或字典 dictionary ) 而非 Series ，也是可以利用這樣的方式創出 DataFrame : </font>** ```python= A=[1,2,3] B=[4,5,6] C=[7,8,9] df=pd.DataFrame({"Series_1":A,"Series_2":B,"Series_3":C}, index=["a",'b','c']) ``` ![](https://i.imgur.com/lpORtE2.png =220x) ```python= A={"a":1,'b':2,'c':3} B={'a':4,'b':5,'c':6} C={'d':7,'e':8,'f':9} df=pd.DataFrame({"Series_1":A,"Series_2":B,"Series_3":C}) ``` ![](https://i.imgur.com/yuHWLAL.png =220x) * #### 給定資料，再賦予它們列跟行label : ```python= df=pd.DataFrame(np.random.randint(10,size=(3,2)), columns=["column_1","column_2"], index=["row_1","row_2","row_3"]) ``` ![](https://i.imgur.com/ji4IiG2.png =220x) ```python= df=pd.DataFrame([["A0" , "B0" ,"C0" , "D0"], ["A1", "B1", "C1" , "D1"], ["A2", "B2", "C2", "D2"], ["A3" , "B3" , "C3" , "D3"]], columns=["A","B","C","D"], index=['a','b','c','d']) ``` ![](https://i.imgur.com/y6ejNqg.png =150x) ## DataFrame 的操作上面我們曾經說到， DataFrame 是二維的 Series ，因為 DataFrame 這樣二維的型態，導致我們在操作上面會有某些程度上的困難度 : 我們要怎麼指定行或列 ? 我們給的指定是預設的 index 還是給定的 label ? 在 pandas 中，預設的指定方式是以「行」為操作單位 `df[row label]` ，但我們要注意到，如果我們的 column 也是以數字做為 label 會讓系統混淆，不知道我們指定的是預設 index 還是我們給定的 index，因此在操作上，有些人會建議，在做 DateFrame 指定或切割時，應該要利用 `.loc`或`.iloc`來操作會比較明確。 ```python= df.loc[列範圍,行範圍]....base on label df.iloc[列範圍,行範圍]...base on location ``` ![](https://i.imgur.com/uyjwm55.png) 以此 DataFrame 為例，我們進行以下操作來了解其差異性 : ```python= >>> df[A] #預設是以「行」為檢索單位 1 A0 2 A1 3 A2 4 A3 Name: A, dtype: object >>> df.loc['1'] #1 指的是我們給定的 index A A0 B B0 C C0 D D0 Name: 1, dtype: object >>> df.iloc[1] #1 指的是系統的預設的location A A1 B B1 C C1 D D1 Name: 2, dtype: object >>> df.loc['1':'3','A':'B'] A B 1 A0 B0 2 A1 B1 3 A2 B2 >>> df.iloc[1:3,1:3] B C 2 B1 C1 3 B2 C2 ``` 了解了如何針對 DataFrame 做檢索後，我們便可以對其進行一些操作 : ![](https://i.imgur.com/uyjwm55.png) * #### 新增資料 : 我們可以進行「行」或「列」的新增 ```python= >>> df.loc[:,"新行"]=["a2","b2","c2","d2"] A B C D 新行 1 A0 B0 C0 D0 a2 2 A1 B1 C1 D1 b2 3 A2 B2 C2 D2 c2 4 A3 B3 C3 D3 d2 ``` ```python= >>> df.loc["新列",:]=["a3","b3","c3","d3"] A B C D 1 A0 B0 C0 D0 2 A1 B1 C1 D1 3 A2 B2 C2 D2 4 A3 B3 C3 D3 新列 a3 b3 c3 d3 ``` 當然也可以進行局部新增，而未被新增資料的行或列便會以 NaN 進行填補 ```python= >>> df.loc["局部",1:3]=["a4","b4"] A B C D 1 A0 B0 C0 D0 2 A1 B1 C1 D1 3 A2 B2 C2 D2 4 A3 B3 C3 D3 局部 NaN a4 b4 NaN ``` * #### 刪除資料 : * 利用`del[column label]`刪除整「行」 ```python= >>> del df["D"] >>> df A B C 1 A0 B0 C0 2 A1 B1 C1 3 A2 B2 C2 4 A3 B3 C3 ``` * 利用`drop('label',axis)`進行「行」(axis=1)或「列」(axis=0)的刪除 ```python= >>> df1_1.drop("D",axis=1) A B C 1 A0 B0 C0 2 A1 B1 C1 3 A2 B2 C2 4 A3 B3 C3 ``` 在這裡特別要注意的是，在使用`del`時，通常我們會建立一個 DataFrame的副本，因為它會改變原 DataFrame ; 而 `drop` 在使用上不會更改原 DataFrame，除非我們在參數內設置 `inplace=True`。 * #### 進行特定條件檢索 : * 尋找特定欄位中最大值所對應的列 `df.idxmax()` ```python= >>> data = {'國家':["Taiwan","American","China","Japan","South Korea"] , '人口':np.random.randint(10**6,10**9,size=5)} >>> population = pd.DataFrame(data) 國家人口 0 Taiwan 952376483 1 American 131554969 2 China 587742313 3 Japan 19808293 4 South Korea 305428287 >>> population.iloc[population["人口"].idxmax(),:] 國家 Taiwan 人口 952376483 Name: 0, dtype: object ``` `idxmax`所傳回的是列 location，因此我們在進行檢索時，必須使用`.iloc` 來處理這個傳回的列 location 值。 * 針對各欄位，配合布林運算符將我們給訂條件之資料列出 ```python= >>> df=pd.DataFrame(np.random.randint(100,size=(5,5)), columns=['A','B','C','D','E'], index=['a','b','c','d','e']) >>> df A B C D E a 55 40 74 22 41 b 11 78 51 71 87 c 60 25 23 15 27 d 8 16 87 8 94 e 94 40 38 21 92 >>> df[(df['A']>50) & (df['D']<40)] # & : and A B C D E a 55 40 74 22 41 c 60 25 23 15 27 e 94 40 38 21 92 >>> df[(df['A']<61) | (df['D']>20)] #| : or A B C D E a 55 40 74 22 41 b 11 78 51 71 87 c 60 25 23 15 27 d 8 16 87 8 94 e 94 40 38 21 92 ``` * 對於分類欄位 ( 非數值欄位 ) 進行 one hot encoding ```python= >>> df=pd.DataFrame({'Day':['Mon','Thu','Mon','Sat','Sun'], 'Color':['Green','Red','Blue','Blue','Orange']}, index=[1,2,3,4,5]) >>> df Day Color 1 Mon Green 2 Thu Red 3 Mon Blue 4 Sat Blue 5 Sun Orange >>>pd.get_dummies(df) ``` ![](https://i.imgur.com/GcNN4Jr.png =500x) ## [Homework](https://github.com/allen108108/2nd-ML100days/blob/master/Day_003-1_HW.ipynb) 想像一個 dataframe 有兩個欄位，一個是國家，一個是人口(隨機產生)，求人口數最多的國家 ```python= data = {'國家':["Taiwan","American","China","Japan","South Korea"] , '人口':np.random.randint(10**6,10**9,size=5)} population = pd.DataFrame(data) population.iloc[population["人口"].idxmax(),:] ``` --- --- ## 讀取各種格式之方法整理 * #### 讀取CSV檔案 ```python= dpath=os.path.abspath("C:\Python\Group_Patrick\Titanic") #資料夾位置 fpath=os.path.join(dpath,"train.csv") #檔案位置 a_train=pd.read_csv(fpath) #讀檔 ``` * #### 讀取txt檔案 ```python= with open(‘example.txt’, ‘r’) as f: data = f.readlines() print(data) ``` * #### 讀取json ```python= import json with open(‘example.json’, ‘r’) as f: data = json.load(f) print(data) ``` * #### 讀取mat矩陣檔 ```python= import scipy.io as sio data = sio.loadmat(‘example.mat’) ``` * #### 讀取圖片 ```python= image = cv2.imread(...) # 注意 cv2 會以 BGR 讀入 image = cv2.cvtcolor(image, cv2.COLOR_BGR2RGB) from PIL import Image image = Image.read(...) import skimage.io as skio image = skio.imread(...) ``` * #### 讀取 Python npy檔 ```python= import numpy as np arr = np.load(example.npy) ``` * #### 讀取 Pickle (pkl) ```python= import pickle with open(‘example.pkl’, ‘rb’) as f: arr = pickle.load(f) ``` * #### 網頁讀取 ```python= page=ur.urlopen("http://www.google.tw") page=page.read() ``` ## [Homework-1](https://github.com/allen108108/2nd-ML100days/blob/master/Day_002_HW.ipynb) ```python= ## 假如我們不想把資料載到自己的電腦裡? # 把連結填入 target_url ='https://raw.githubusercontent.com/vashineyu/slides_and_others/master/tutorial/examples/imagenet_urls_examples.txt' ``` ```python= import requests response = requests.get(target_url) data = response.text # 用 request 傳送回來的資料不會認得斷行符號 print(len(data)) data[0:100] ``` ```python= # 找到換行符號，用該符號做字串分割後，把它拿掉 split_tag = data = data.split(split_tag) print(len(data)) data[0] ``` ## [Homework-2](https://github.com/allen108108/2nd-ML100days/blob/master/Day_002_HW.ipynb) ```python= import pandas as pd df = pd.DataFrame(arrange_data) df.head() ``` 讀取圖片，請讀取上面 data frame 中的前 5 張圖片 ```python= from PIL import Image from io import BytesIO import numpy as np import matplotlib.pyplot as plt # 請用 df.loc[...] 得到第一筆資料的連結 first_link = response = requests.get(first_link) img = Image.open(BytesIO(response.content)) # Convert img to numpy array plt.imshow(img) plt.show() ``` ```python= def img2arr_fromURLs(url_list, resize = False): """ 請完成這個 Function Args - url_list: list of URLs - resize: bool Return - list of array """ return img_list ``` ```python= result = img2arr_fromURLs(df[0:5][1].values) print("Total images that we got: %i " % len(result)) # 如果不等於 5, 代表有些連結失效囉 for im_get in result: plt.imshow(im_get) plt.show() ```