Day 3. 3-1如何新建一個 dataframe? 3-2 如何讀取其他資料? (非 csv 的資料)

###### tags: `ML100Days` # Day 3. 3-1如何新建一個 dataframe? 3-2 如何讀取其他資料? (非 csv 的資料) Day3 延續 Day2 的 Pandas 操作，為此我還去 [Kaggle learn](https://www.kaggle.com/learn/overview) 研讀了一番 [Pandas](https://www.kaggle.com/learn/pandas) 基礎教學。建議想了解 Pandas 的初學者可以到 Kaggle Learn 練習一下，我個人覺得蠻有幫助的！ [作業3-1&3-2(連結 my GitHub)](https://github.com/lidopypy/2nd-ML100Days/blob/master/homework/Day_003_HW.ipynb) 統一在一個檔案了。以下來記錄一下習題的內容實作。 Day 3 用到的lib. ```python= import pandas as pd import numpy as np from PIL import Image from io import BytesIO import matplotlib.pyplot as plt import requests ``` ## 作業3-1 練習操作如何產生下圖這類的 `pd.DataFrame` ![](https://i.imgur.com/ZvdlJ9J.png) 有練習過 kaggle learn 後，kaggle 的練習 ```python=+ animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2']) ``` ![](https://i.imgur.com/pP9BlJA.png) 照著格式試著將作業寫出，因為人口數是隨機的，會用到 `np.random.randint` 先建立"國家與人口"的 `dict()` ```python=+ country=['Taiwan','United States','Thailand'] population=np.random.randint(1, 10, size=3) * (10**np.random.randint(6, 8, size=3)) data = {'國家': country, '人口':population } ``` 接著將 data 丟進 `pd.DataFrame` ，這邊 index 不需要指定，就從0開始。 ```python=+ data = pd.DataFrame(data) print(data) ``` ``` 國家人口 0 Taiwan 9000000 1 United States 5000000 2 Thailand 70000000 ``` 隨手練習一下 `max()`, `idmax()` 等操作，統計一下人口數。 ```python=+ population_max_country = data['國家'][data['人口'].idxmax()] population_min_country = data['國家'][data['人口'].idxmin()] population_max = data['人口'].max() population_min = data['人口'].min() print('人口最多的國家:', population_max_country ,'，總共有 ' + str(population_max) + "人") print('人口最少的國家:', population_min_country ,'，總共有 ' + str(population_min) + "人") ``` ``` 人口最多的國家: Thailand ，總共有 70000000人人口最少的國家: United States ，總共有 5000000人 ``` ## 作業3-2 練習讀取txt檔，與其檔案內容，請讀取 [text file](https://raw.githubusercontent.com/vashineyu/slides_and_others/master/tutorial/examples/imagenet_urls_examples.txt) 懶人複製連結: https://raw.githubusercontent.com/vashineyu/slides_and_others/master/tutorial/examples/imagenet_urls_examples.txt 其內容大概是一堆圖片的連結，如下: ![](https://i.imgur.com/IgDzXks.png) 我們要練習逐筆讀取將圖片 `plt` 出來。首先讀取txt檔案。 ```python=+ target_url = 'https://raw.githubusercontent.com/vashineyu/slides_and_others/master/tutorial/examples/imagenet_urls_examples.txt' ``` 要抓取HTTP 上的檔案，需要使用 `requests` 模塊: ```python=+ response = requests.get(target_url) data = response.text # 用 request 傳送回來的資料不會認得斷行符號 print(len(data)) data[0:100] ``` ``` 784594 'n00015388_157\thttp://farm1.static.flickr.com/145/430300483_21e993670c.jpg\nn00015388_238\thttp://farm2' ``` 因為資料讀取下來不會分辨空行('\t')與換行('\n')的符號，所以每個字都算一個位元，我們接著要來處理空行與換行的分割。先將逐行當作一個data: 透過 `split('\n')` ```python=+ split_tag = '\n' data = data.split(split_tag) print(len(data)) data[1] ``` ``` 9996 'n00015388_238\thttp://farm2.static.flickr.com/1005/3352960681_37b9c1d27b.jpg' ``` 可以看到完整的一行即一個data ，n00015388_238\thttp://farm2.static.flickr.com/1005/3352960681_37b9c1d27b.jpg n00015388_238 為該圖片命名，後面是圖片連結。接著要將這命名與連結分開: 透過 `split('\t')` 並將分開後的data存進 `list()` 中保存，再建立一個 `dict()` 才能將資料轉成 `pd.DataFrame` ```python=+ def split_data(data): name=[] url=[] for i in data: name.append(i.split("\t")[0]) try: url.append(i.split("\t")[1]) except : url.append("") data_dict = {'name': name, 'url':url} return data_dict ``` 寫成函數，將data丟進分割後轉成 pd.DataFrame ```python=+ data=split_data(data) df_data = pd.DataFrame(data) df_data.head() ``` ``` name url 0 n00015388_157 http://farm1.static.flickr.com/145/430300483_2... 1 n00015388_238 http://farm2.static.flickr.com/1005/3352960681... 2 n00015388_304 http://farm1.static.flickr.com/27/51009336_a96... 3 n00015388_327 http://farm4.static.flickr.com/3025/2444687979... 4 n00015388_355 http://img100.imageshack.us/img100/3253/forres... ``` 建立好 pd 格式的 data 後，透過 df 的操作將第一個連結抓到的圖片畫出來看看。 ```python=+ first_link = df_data.loc[0][1] response = requests.get(first_link) img = Image.open(BytesIO(response.content)) # Convert img to numpy array plt.imshow(img) plt.show() ``` ![](https://i.imgur.com/IvQ5DrA.png) 最後作業要求使用 pd 操作將 data 的前五個連結畫出，其中還藏了小陷阱，有些網頁連結是壞掉的，所以可以使用 `try & except` 跳過 error 的連結。 ```python=+ def img2arr_fromURLs(url_list, resize = False): img_list=[] for url in url_list: response = requests.get(url) try: img=Image.open(BytesIO(response.content)) except : pass else: img_list.append(img) return img_list ``` ```python=+ result = img2arr_fromURLs(df_data['url'][0:5]) print("Total images that we got: %i " % len(result)) # 如果不等於 5, 代表有些連結失效囉 for im_get in result: plt.imshow(im_get) plt.show() ``` ![](https://i.imgur.com/1YfBJmA.png) ![](https://i.imgur.com/cMs0juE.png) ![](https://i.imgur.com/0b259Cw.png) ![](https://i.imgur.com/jMvnb4W.png) 大功告成，作業3練習到很多東西，統整一下: 1. `pd.DataFrame` 的建立 2. HTTP 網頁資料的讀取 `requests` 3. txt 檔案內的資料分割 `split` 4. try , except 的例外操作 (這很久沒寫，但實際在 Kaggle 的資料下載，還真的會遇到error的問題!!)