python數據分析入門

###### tags: `選修` # python數據分析入門 + [部份範例來源：成為python數據分析達人的第一課(自學課程)](http://moocs.nccu.edu.tw/course/123/intro) ## 一、資料收集 1. ### 資料收集管道 - 自行收集 * 客戶資料 * 問卷調查 * 自行架設感測器 - 開放資料集 * [政府資料開放平台](https://data.gov.tw) * [臺北市資料大平臺](https://data.taipei/) * [臺南市政府資料開放平台](http://data.tainan.gov.tw) * [Kaggle](https://www.kaggle.com/datasets) - 商業公司API * [Facebook Graph API](https://developers.facebook.com/docs/graph-api/using-graph-api) * [YouTube Data API](https://developers.google.com/youtube/v3/getting-started) * [Twitter Developers](https://developer.twitter.com/en/docs) * [Flickr API](https://www.flickr.com/services/developer/api) - 網頁爬蟲 - 付費 2. ### 資料開放平台 - 網路連線 ``` javascript= import requests url = 'https://cs.cysh.cy.edu.tw' html = requests.get(url) print(html.text) ``` - 開放平台API串接 ``` javascript= import requests # 抓取寵物登記站名冊：臺北市資料大平臺/資料目錄/主題分類瀏覽/農業/寵物登記站名冊/API url = 'https://data.taipei/opendata/datalist/apiAccess?scope=resourceAquire&rid=e0241f72-7db0-4d04-91a3-571d1af69f6b' resp = requests.get(url) data = resp.json() # 以json()方法解析 print(data) vhos_list =data['result']['results'] for item in vhos_list: # hospital為字典型態 if item['地址'][:3] == '大安區': print(f"{item['寵物登記機構名稱']}\n{item['電話']}") ``` :::info EX_01：至[行政院環境保護署。環境資源資料開放平臺](https://opendata.epa.gov.tw/)，抓取嘉義PM2.5的資料。 ::: ``` javascript= import requests url = 'https://opendata.epa.gov.tw/api/v1/ATM00625?%24skip=0&%24top=1000&%24format=json' ........ # 若來源資料為 https 則加上 verify=False 參數 ........ # 以json()方法解析 print(data) ........ # 對data中的每一個字典 if ........ == '嘉義': # 如果字典中Site等於嘉義 print(item) ``` 3. ### 網路爬蟲 Web Crawler - [文組也看得懂的 - 網路爬蟲](https://www.youtube.com/watch?v=BdRjutf8K0c) - HTML 格式使用 BeautifulSoup 來解析 - [Python爬虫利器二之Beautiful Soup的用法](https://cuiqingcai.com/1319.html) - [爬蟲使用模組介紹-Beautiful Soup 1](https://ithelp.ithome.com.tw/articles/10206419)、[爬蟲使用模組介紹-Beautiful Soup 2](https://ithelp.ithome.com.tw/articles/10206668) - [网页爬虫教程系列 | 莫烦Python](https://morvanzhou.github.io/tutorials/data-manipulation/scraping/) - [Python 爬虫学习系列教程](http://wiki.jikexueyuan.com/project/python-crawler-guide/) - [Python 爬蟲實戰](https://www.slideshare.net/tw_dsconf/python-83977397) ``` javascript= import requests from bs4 import BeautifulSoup url = 'http://cs.cysh.cy.edu.tw' html = requests.get(url) # print(html.text) sp = BeautifulSoup(html.text, 'lxml') # print(sp.h1) a_tags = sp.find_all('a') # 所有a標籤組成的list for itm in a_tags: print(itm.text) # 輸出超連結的文字 ul_tags = sp.find('ul',{'class':'alt'}) # ul標籤，class為alt ul_a_tags = ul_tags.find_all('a') for itm in ul_a_tags: print(itm.get('href')) ``` :::info EX_02：至[《美麗佳人》／Love & Sex／星座運勢](https://www.marieclaire.com.tw/love-sex/astrology)，抓取星星教授安格斯的每週星座運勢。 ::: ``` javascript= import requests from bs4 import BeautifulSoup url = "https://www.marieclaire.com.tw/love-sex/astrology" ........ # 取得網頁內容 ........ # 以 BeautifulSoup 分析 hot = soup.find_all('div', {'class':'hot'}) # 文章在<div class="hot">區塊 # print(hot) articles = hot[0].find_all('a') # print(articles) item = articles[0] # print(item) # print(item.attrs) # a的參數 # print(item['href']) # print(item['href'].split('/')) # 分解超連結 weekly_url = item['href'] article_num = item['href'].split('/')[5] for item in articles: if '星星教授安格斯' in item['title'] and item['href'].split('/')[5] > article_num : # 抓超連結最後數字最大者 weekly_url = item['href'] article_num = item['href'].split('/')[5] weekly_resp = requests.get(weekly_url) weekly_soup = BeautifulSoup(weekly_resp.text, 'lxml') article = weekly_soup.find_all('div', {'class':'........'})[0] # 找找看本週星座運勢在那一個區塊 fortune = {} for item in article.find_all('h2'): sign = item.string[:2] # 星座 luck = item.next_sibling.string # h2標籤後一個p fortune[sign] = luck # 組成字典 # print(fortune['牡羊']) from ipywidgets import interact_manual # 互動選單 interact_manual(lambda 星座: fortune[星座], 星座=['魔羯','射手','天蠍','天秤','處女','獅子','巨蟹','雙子','金牛','牡羊','雙魚','水瓶']); ``` 4. ### 正規表達式(Regular expression) - [用 Regular Expression 做字串比對](https://larry850806.github.io/2016/06/23/regex/) - [JavaScript RegExp 对象](http://www.w3school.com.cn/js/jsref_obj_regexp.asp) - [Python正则表达式指南](https://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html) - [Online regex tester and debugger](https://regex101.com/) - [正则表达式 - Python 基础 | 莫烦Python](https://morvanzhou.github.io/tutorials/python-basic/basic/13-10-regular-expression/) ``` javascript= str = 'dog run' print('dog' in str) print('cat' in str) import re print(re.search('dog', str)) print(re.search('cat', str)) # []匹配多種可能 ptn = r'r[ua]n' # 字串前的 r 表示這是正規表達式，run或ran print(re.search(ptn, str)) print(re.search(r'r[A-Z]n', str)) print(re.search(r'r[a-z]n', str)) print(re.search(r'd[0-9a-z]g', 'd0g run')) # d、g間為數字或英文 print(re.search(r'd\dg', 'd0g run')) # \d 表示任何数字 ``` ``` javascript= import requests from bs4 import BeautifulSoup import re url = 'https://pixabay.com/' resp = requests.get(url, headers={'User-Agent':'Mozilla/5.0'}) # 讓網站以為是人操作瀏覽器上網 soup = BeautifulSoup(resp.text, 'lxml') img_links = soup.find_all('img', {'src': re.compile('.+\.jpg')}) # 以正規表達式，選取副檔名為jpg的圖片連結 for link in img_links: print(link.get('src')) ``` ## 二、NumPy + 做數據分析時常把list轉成array，NumPy的ndarray(N-Dimentional Array)是一個快速、節省空間的多維度陣列，可提供向量運算及複雜的功能。 + NumPy的ndarray中的所有元素的資料型態必須相同，每個陣列都有一個shape(各維度大小的元組)和dtype(元素資料類型) - 轉換的優先順序為：字串 > 數字 > 布林。 + 參考網站 - [numpy 用法 (1)](http://violin-tao.blogspot.com/2017/05/numpy-1.html) - [numpy 用法 (2)](http://violin-tao.blogspot.com/2017/06/numpy-2.html) - [Data Science — Numpy Basic !](https://medium.com/pyladies-taiwan/data-science-numpy-basic-d3a6ca6c715c) - [Python玩數據 (2)：Numpy [1/2]](https://www.ycc.idv.tw/python-play-with-data_2.html) - [Python玩數據 (3)：Numpy [2/2]](https://www.ycc.idv.tw/python-play-with-data_3.html) - [NumPy User Guide](https://docs.scipy.org/doc/numpy/user/index.html) - [NumPy API Reference](https://docs.scipy.org/doc/numpy/reference/) - [The N-dimensional array (ndarray) API Reference](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html) 1. ### 直接建立陣列 ``` javascript= import numpy as np a = np.array([1,2,3]) # 一維陣列 a a.ndim # 幾維陣列 a.shape # 陣列形狀 a.size # 陣列元素個數 a.dtype # 陣列的資料型態 a_f = a.astype(np.float64) # 轉換成浮點數型態 a_f.dtype b = np.array(['1','2.2','3.33']) b_f = b.astype(float) # 字串陣列若全都是數字，可用astype將其轉換為數值 b_f b_f.dtype c = np.array([[1,2,3], [4,5,6]]) # 二維陣列 c c.ndim c.shape c.size ``` 2. ### [Numpy的數學(統計)運算](https://docs.scipy.org/doc/numpy/reference/routines.math.html) - [[python] numpy axis概念整理筆記](http://changtw-blog.logdown.com/posts/895468-python-numpy-axis-concept-organize-notes) ``` javascript= a = np.array([3,4,5,1,2]) np.sort(a) # 排序 a.sum() # 加總，同 np.sum(a) a.mean() # 平均 a.max() # 最大值 a.min() # 最小值 a.std() # 標準差 np.mod(5,2) # 求餘數 np.sin(np.pi/2) # 求sin np.log(100) a = np.array([[1,2,3], [4,5,6]]) a.sum() # 21，axis=None或defalut，對矩陣內所有元素作運算。同 np.sum(a) a.sum(axis=0) # [1+4, 2+5, 3+6]。同 np.sum(a, axis=0) a.sum(axis=1) # [1+2+3, 4+5+6]。同 np.sum(a, axis=1) ``` 3. ### 陣列快速生成法 - np.linspace(start, stop, size, endpoint) * 在stat~stop範圍，產生均勻間隔、指定數量的元素。endpoint=False 不包含stop。 - np.arange(start, stop, step) * 在start~stop範圍，根據間隔大小(step)生成元素 * 類似python range，回傳型態為ndarray (range回傳型態為list) - np.random.randint(low, high, size) * 在low~high(不包含)內，隨機產生指定數量的元素 ``` javascript= import numpy as np np.zeros(3) # 建立皆為0的1X3陣列 np.zeros((3,5)) # 建立皆為0的3X5陣列 np.zeros((3,5), dtype=np.int) np.ones(3) # 建立皆為1的1X3陣列 np.full((3,5), 7) # 建立皆為7的3X5陣列 np.eye(3) # 建立3x3的單位矩陣(identity matrix) np.diag([1,2,3]) # 建立3X3陣列，對角為1,2,3 np.random.randint(0, 10, 9) # 建立1維陣列，9個0~9之間的亂數 np.random.randint(0, 10, (3,3)) # 建立3X3陣列，內容為0~9之間的亂數 np.random.normal(0, 1, (3,3)) # 建立3X3的常態分佈矩陣，平均值為0，標準差為1 np.arange(3) # 建立 [0 1 2] np.arange(10, 20, 2) # 建立 [10, 12, 14, 16, 18] np.linspace(0, 10, 6) # 0~10平均取6個點 xy = [[x, y] for x in range(5) for y in range(3)] np.array(xy) ``` 4. ### 陣列重塑形狀 - np.reshape((new_row, new_column)) * 將陣列轉換成新列數、新行數 - np.flatten()、np.ravel() * 將多維陣列降為一維。 * flatten()返回一份拷貝，對拷貝所做的修改不會影響原始陣列。 * ravel()返回的是檢視(view，類似C/C++的reference)，如果改變了，會影響原始陣列。 ``` javascript= import numpy as np a = np.array([1,2,3,4,5,6,7,8]) a a.shape a = a.reshape(2,4) # 同 a.shape = (2,4) a a.shape b = a.flatten() # 將矩陣重組為一維陣列 b b.shape a.shape = (2,4) c = a.reshape(1,8) # 還是2維陣列 c c.shape ``` 5. ### 資料存取 - 同list，可使用index(從0開始算)、slice、iterate 來存取陣列裡的資料。 ``` javascript= import numpy as np a = np.array([1,2,3,4,5]) a[2] # 取index 2 的元素 a[2:5] # 取index 2~4 的元素 a[:5:2] # 取index 0、2、4 的元素 a[::-1] # 後往前取 a[1:3] = 7 # 將index 1~3 的元素改為 7 b = np.array([[1,2,3], [4,5,6], [7,8,9]]) b[0, 1] # 選取第0列，第1欄的元素 b[0][1] # 同上 b[1, :] # 選取第1列的全部元素 b[1] # 同上 b[1:3] # 選取1~2列多列元素 b[:, 2] # 選取全部列的第2欄(index 2)的元素 x = np.arange(9).reshape((3,3)) # 產生[0,1,2,3,4,5,6,7,8]，並轉為3X3陣列 np.diag(x) # 取對角線元素 a = np.arange(10) b = np.arange(5) a[5:] = b[::-1] # a為[0,1,2,3,4,4,3,2,1,0] ``` - 以判斷式來進行篩選 ``` javascript= import numpy as np math = np.array([60,70,55,80,90]) math >= 60 # 每個元素是否 >=60 math[math >= 60] # >=60 的元素 np.count_nonzero(math >= 60) # 計算 >=60 的有多少位 np.sum(math >= 60) # 計算 >=60 的有多少位，True == 1, False == 0 np.where(math >= 60)[0] # 成績 >=60 的index np.any(math >= 60) # 是否有一人成績 >=60 np.all(math >= 60) # 是否全班都及格 ``` 6. ### 矩陣element-wise運算(每個相對應元素做運算) ``` javascript= import numpy as np us = [100,500,400] us * 3 us_arr = np.array(us) rate = 30.8 us_arr * rate grades = np.array([80, 60, 70]) weights = np.array([0.3, 0.3, 0.4]) g = grades * weights a = np.array([[1,2], [3,4]]) b = np.array([[5,6], [7,8]]) a+b # element-wise plus c = np.array([1, 2, 3]) c**2 ``` 7. ### 矩陣操作(matrix operation) ``` javascript= import numpy as np a = np.array([1,2,3,4,5,6,7,8,9]) np.cumsum(a) # a 中元素的累計和(cumulative sum) np.diff(a) # 兩兩元素差 np.clip(a,3,8) # 將元素限製在3~8間，大於8改為8，小於3改為3 a = np.array([[1,2], [3,4]]) b = np.array([[5,6], [7,8]]) np.dot(a, b) # 矩陣相乘，同 a.dob(b) a.T # 轉置矩陣，同 np.transpose(a) a_rev = np.linalg.inv(a) # 反矩陣 np.dot(a, a_rev) np.vstack((a, b)) # 垂直方向合併 np.hstack((a, b)) # 水平方向合併 a = np.array([1,2,3]) b = np.array([4,5,6]) z = np.c_[a, b] # numpy中的zip。np.c_是列向相加，列向量在axis = 1接起來 # array([[1, 4], # [2, 5], # [3, 6]]) np.r_[a, b] # np.r_是行向相加，列向量在axis = 0接起來 a = z[:,0] # unzip b = z[:,1] a = np.arange(9) np.split(a, 3) # 分割成3段 a = np.arange(8) np.array_split(a, 3) # split必須可以均分，否則會錯誤。array_split不會 ``` 8. ### 資料擴充與重複 ``` javascript= import numpy as np np.append([0,0], [1,1,1]) # append(原陣列, 要附加的元素) [0, 0, 1, 1, 1] np.insert([0,0], 1, [1,1,1]) # insert(原陣列, 要插入的位置, 要插入的元素) [0, 1, 1, 1, 0] a = np.array([0,1]) np.tile(a, 3) # tile()對整個數組進行複製。把[0,1]看成一組，重複3遍 [0, 1, 0, 1, 0, 1] np.tile(a, (2,3)) # 將[0, 1]變成2x3陣列 np.repeat(a, 2) # 每個元素重複2次 a = np.array([[1,2], [3,4]]) np.repeat(a, 2) np.repeat(a, 2, axis=0) # axis指定要往哪個方向的維度擴展 np.repeat(a, 2, axis=1) ``` 9. ### 陣列拷貝與刪除 - slice出來的新陣列，與原本的陣列共用同一塊記憶體，copy()出來的新陣列不會。 ``` javascript= import numpy as np a = np.arange(5) b = a[::2] np.may_share_memory(a, b) a[0] = -1 # 新陣列b與原陣列a用同一個記憶體空間，當a改變時，b會跟著改變 a b a = np.arange(5) c = a[::2].copy() # copy出來的新陣列c，不與原陣列a用同一個記憶體空間 np.may_share_memory(a, c) a[0] = -1 a c a = np.arange(10) index = [1, 3, 5] np.delete(a, index) # 刪除索引值的元素 a = np.array([[1,2,3], [4,5,6], [7,8,9]]) np.delete(a, 1, axis=0) # delete(原陣列, 要刪除的索引值, 維度方向) np.delete(a, 1, axis=1) ``` ## 三、資料視覺化 + 參考網站 - [[資料分析&機器學習] 第2.5講：資料視覺化(Matplotlib, Seaborn, Plotly)](https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC2-5%E8%AC%9B-%E8%B3%87%E6%96%99%E8%A6%96%E8%A6%BA%E5%8C%96-matplotlib-seaborn-plotly-75cd353d6d3f) 1. ### Matplotlib - 線圖（Line plot） ``` javascript= import matplotlib.pyplot as plt x = [1, 2, 3, 4, 5] y = [2, 9, 4, 8, 3] plt.plot(x, y) plt.xlabel('x label') # 加上座標軸的label plt.ylabel('y label') ``` ``` javascript= import numpy as np import matplotlib.pyplot as plt plt.plot(np.random.randn(100)) # randn(100) 產生100個常態分布的亂數(平均值0，標準差1) ``` ``` javascript= import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 10, 100) # 0~10間，產生100個點 plt.plot(x, np.sin(x)) plt.plot(x, np.cos(x),'bo') # blue，circle marker ``` - 散佈圖（Scatter plot） ``` javascript= import matplotlib.pyplot as plt x = [2, 1, 3, 4, 5] y = [9, 2, 4, 8, 3] plt.scatter(x, y) ``` - 長條圖（Bar plot） ``` javascript= import matplotlib.pyplot as plt x = [1, 2, 3, 4, 5] y = [2, 9, 4, 8, 3] plt.bar(x, y) ``` - 直方圖（Histogram） ``` javascript= import numpy as np import matplotlib.pyplot as plt normal_samples = np.random.normal(size = 1000) # 生成 1000 組標準常態分配隨機數（平均值0，標準差1 的常態分配） plt.hist(normal_samples,bins = 100) # bins 為直方圖直條個數 ``` ## 四、Pandas數據分析 + Python的Excel，pandas 可以擷取 JSON, CSV, Excel, HTML 等格式的資料，主要資料型態有Series(一維)、DataFrame(二維、類似表格)。 + 基於Numpy構建，包含許多操作資料與統計的函式，讓以NumPy為中心的應用變的更加簡單。 + 參考網站 - [Python pandas Q&A video series](https://github.com/justmarkham/pandas-videos) - [Python資料分析（四）Pandas](https://medium.com/@allaboutdataanalysis/python%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E5%9B%9B-pandas-e2fdeb6808c1?postPublishedType=repub&fbclid=IwAR2RmaITrI5mBFAYQ4pZFZ3TLUbwUTIXs2XV3X4_TmBbs-v-O52prqXHRRk) - [[Python] Pandas 基礎教學](https://oranwind.org/python-pandas-ji-chu-jiao-xue/) - [Python Pandas 基本操作教學_成績表](https://medium.com/@weilihmen/python-pandas-%E5%9F%BA%E6%9C%AC%E6%93%8D%E4%BD%9C%E6%95%99%E5%AD%B8-%E6%88%90%E7%B8%BE%E8%A1%A8-f6d0ec4f89) - [[資料分析&機器學習] 第2.3講：Pandas 基本function介紹(Series, DataFrame, Selection, Grouping)](https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC2-3%E8%AC%9B-pandas-%E5%9F%BA%E6%9C%ACfunction%E4%BB%8B%E7%B4%B9-series-dataframe-selection-grouping-447a3fa90b60) - [莫煩Python](https://morvanzhou.github.io/) - [pandas 用法 (1)](http://violin-tao.blogspot.com/2017/06/pandas-1-indexing.html)、[pandas 用法 (2)](http://violin-tao.blogspot.com/2017/06/pandas-2-concat-merge.html) - [pandas 官方手冊](https://pandas.pydata.org/pandas-docs/stable/index.html) 1. ### pandas 讀取資料 - 讀取 CSV 檔案 - 讀取 Html 檔案 ``` javascript df = pd.read_html('http://rate.bot.com.tw/xrt?Lang=zh-TW') df[0] ``` 2. ### DataFrame - 建立 DataFrame：可以透過 Dictionary、Array 來建立，也可以讀取外部資料(CSV、資料庫等)來建立。 - DataFrame 基本資訊 | 基本資訊方法 | 說明 | |:--------------|:-------------| | df.info() | 表格的資訊 | | df.shape | 表格列數與欄數 | | df.columns | 回傳欄位名稱 | | df.index | 回傳 index | | df.head(n) | 回傳前n筆資料，省略預設為5筆 | | df.tail(n) | 回傳後n筆資料，省略預設為5筆 | | df.describe() | 回傳各類統計資料 | - [資料選取](https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/462517/) | 選取方法 | 說明 | | -------------| --------- | | df['欄位名'] | 選取欄的元素| | df.欄位名 | 選取欄的元素| | df.loc() | 根據行、列標籤來選取 | | df.iloc() | 根據索引值來選取| | df.ix() | 以行、列的標題或位置來選取，loc和iloc的綜合體| - 資料篩選：在中括號裡面放入篩選條件 - 資料排序 | 排序方法 | 說明 | | ------------ | --------- | | df.sort_index() | 依索引值排序，axis指定是列索引值或行索引值排序 | | df.sort_values()| 依欄位值排序，by指定排序欄位，ascending指定升冪或降冪排序| :mega: [pandas中axis參數](https://stackoverflow.com/questions/25773245/ambiguity-in-pandas-dataframe-numpy-array-axis-definition) :::info EX_03：上傳[pandas_grades](https://drive.google.com/open?id=1kXda1ylxi5V8htWYQGc7BDA4m0OhfGwp)至Colab，練習dataframe的基本操作(選取、篩選、排序...)。 ::: ``` javascript= import numpy as np import matplotlib.pyplot as plt import pandas as pd df = pd.read_csv('pandas_grades.csv') # 基本資訊 df.info() df.shape df.columns df.index df.head() df.tail() # 資料選取 # 欄 df['國文'] df[['國文','英文']].head() # 多個欄位用list裡面放欄位名稱 # 列 df[5:10] # 取得index 5~9 的資料 df.loc[5:10, ['國文','英文']] # 取得index 5~10 的資料，同df.loc[5:10][['國文','英文']] df.loc[5:10, '國文':'數學'] df.iloc[5:11, 1:4] # 同上，用iloc以索引值來選取 df.國文 # 同 df['國文'] cg = df.國文.values cg.mean() cg.std() df.國文.mean() df.describe() # 取得 df 的統計資訊(每個欄位的平均值、最大值、最小值、標準差…) # 資料篩選 test = df.iloc[:3] test[ [True, False, True] ] # 列篩選的作法為保留布林list(和資料筆數同)中True索引那一列 filter = df['姓名'].str.contains('陳') df[filter] df[ df['數學'] >= 14 ] df[ df.數學 == 15 ] df[ (df.數學 == 15) & (df.英文 == 15) ] # panda中 &(and)、|(or) # 繪圖 df.國文.plot() df.國文.hist(bins=15) df.corr() # 相關係數 df.國文.corr(df.數學) df['總級分'] = df[['國文', '英文', '數學', '社會', '自然']].sum(axis=1) # df["新欄位名稱"]=df[要加總的欄位的list].sum(axis=1) df.head() df['主科'] = df.數學*1.5 + df.英文 # 新增欄位 df.head() # 排序 df.sort_values(by='總級分', ascending=False).head(20) df.sort_values(by=['主科', '總級分'], ascending=False).head(20) ``` - [數據合併、連接](https://zhuanlan.zhihu.com/p/38184619) | 合併方法 | 說明 | | --------| --------- | | concat | 沿縱向或橫向，將多個df合併 | - 資料刪除 | 刪除方法 | 說明 | | --------------| --------- | | del df['欄名'] | 刪除單行(直接修改df) | | df.drop(['欄名1', '欄名2'], axis=1) | 刪除多行，axis=1為刪除行，axis=0為刪除列。(不會直接修改df，加入inplace=True才會修改df) | - 遺漏值處理 * 缺失資料如果用「平均值、中位數、眾數、隨機值等替代」，效果一般，因為等於人為增加了雜質。 * 如果資料量大，可直接將資料刪除。 | 遺漏值處理方法 | 說明 | | -------- | --------- | | df.isnull() | 判斷是否為空值 | | df.notnull()| 判斷是否不為空值| | df.dropna() | 刪除有NaN(Not a number, 缺失值)的列 | | df.fillna(value=0) | 將NaN填入0 | - 資料分群 | 分群方法 | 說明 | | ------------ | --------- | | sector = df.groupby(by='欄位名') | 以欄位名分群，後面可加入要運算的函式，如sum、count、mean | | sector.size() | 每個群的大小 | | sectors.get_group('A') | 取出欄位名為'A'的資料 | :::info EX_04：DataFrame建立、合併操作、NaN處理。 ::: ``` javascript= import numpy as np import matplotlib.pyplot as plt import pandas as pd mydata = np.random.randn(4,3) # 產生 4x3 標準常態分佈亂數(平均值0，標準差1) df1 = pd.DataFrame(mydata, columns=list('ABC')) # 將array放入DataFrame df2 = pd.DataFrame(np.random.randn(3,3), columns=list('ABC')) df3 = pd.concat([df1, df2], axis=0) df3.index = range(7) df4 = pd.concat([df1, df2], axis=1) df4.columns=list('ABCDEF') df4.info() # D、E、F欄後為3 non-null，表示有一個NaN df4.isnull() df4.notnull() df4.dropna() df4.fillna(value=0) values_to_fill = {'D': 1, 'E': 2 ,'F': 3} # 指定將 'D'欄填入1，'E'欄填入2，'E'欄填入3 df4.fillna(value = values_to_fill) df4.mean() # 計算每個欄位的平均值，再將它填入NaN df4.fillna(df4.mean()) ``` :::info EX_05：美國哪裡最容易看到UFO。(分群) ::: ``` javascript= import pandas as pd df = pd.read_csv('http://bit.ly/uforeports') df.head() sector = df.groupby(by='Shape Reported') # 以 'Shape Reported' 分組 sector.size() # 每個組別的大小 sector.get_group('DISK').head() # 取得 'Shape Reported' 為 'disk' 的資料 df_state = df.groupby(by='State').count() df_state.sort_values(by='Time', ascending=False) # 原始資料並不會被排序後的資料覆蓋 df_state.sort_values(by='Time', ascending=False, inplace=True) # 覆蓋原始資料 df_state[:10].Time.plot(kind='bar') ``` ## 五、特徵標準化(Normalization)與選擇 + 透過特徵標準化，可以讓每個特徵的數值都落在某一特定的區間。可以優化梯度下降法，還可以提高精準度。 - 有些分類器需要計算樣本間的距離(KNN)，如果一個特徵值的範圍非常大，那麼距離就會取決於這個特徵，導致分析的結果失真。 + 「特徵選取」（feature selection）的目標是要從原有的特徵中挑選出最佳的部分特徵，可改善高維度數據的準確率。 1. ### 常見的標準化 - Min-Max 標準化 * 將特徵數值按比例縮放到特定區間(0~1 或 -1~1) - Z-score 標準化 * 將特徵數值縮放成平均為 0、標準差為 1 - 參考網站 * [機器學習：特徵標準化！](https://ithelp.ithome.com.tw/articles/10197357) * [【資料科學】 - 資料的正規化與標準化](https://aifreeblog.herokuapp.com/posts/54/data_science_203/) * [莫煩-正規化Normalization](https://morvanzhou.github.io/tutorials/machine-learning/sklearn/3-1-normalization/) * [4.3. 数据预处理](http://sklearn.lzjqsdd.com/modules/preprocessing.html) :::info EX_06：標準化對SVM分類的影響？ ::: ``` javascript= from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.datasets.samples_generator import make_classification from sklearn.svm import SVC import numpy as np import matplotlib.pyplot as plt x,y = make_classification(n_samples=300,n_features=2,n_redundant=0,n_informative=2, random_state=3,scale=100,n_clusters_per_class=1) # 生成分類模型數據 plt.scatter(x[:,0],x[:,1],c=y) x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2) clf = SVC() clf.fit(x_train,y_train) clf.score(x_test,y_test) ``` ``` javascript= x_scaled = preprocessing.scale(x) # 標準化X plt.scatter(x_scaled[:,0],x_scaled[:,1],c=y) x_train,x_test,y_train,y_test = train_test_split(x_scaled,y,test_size=0.2) clf = SVC() clf.fit(x_train,y_train) clf.score(x_test,y_test) ``` 2. ### 特徵選擇 Feature Selection - 移除低變異數的特徵(Removing features with low variance) * 變異數可以描述特徵的離散程度。一般情況下，可以刪除變異數過小的特徵(偏向某一兩個分類)，因為其對結果的解釋力通常不大。 * 若建立 VarianceThreshold 時不設定 threshold 參數，預設會篩選掉變異數為零的欄位。 - 單變數特徵選擇(Univariate Feature Selection) * 單變數特徵選擇是根據每個特徵的單變數統計值(如chi2)作為門檻，然後依據該門檻選擇特徵(如SelectKBest)。 * Scikit-learn封装了一些特徵選擇方法如 (1) SelectKBest：將每個特徵的統計值算出後，選出 k 個最佳的特徵。 (2) SelectPercentile：將每個特徵的統計值算出後，選擇排前百分之幾最有意義的特徵。 * Select K Best為分類提供了三種評價特微的方式 (1) chi2(卡方統計值 χ2)能評估兩個變數是否互相獨立。該值愈大，兩變數愈有機會是相關的，愈小則愈不相關。 (2) f_classif(樣本方差F值) (3) mutual_info_classif(離散類別交互信息) (4) 另有用於回歸的f_regression - 利用模型選擇特徵 * 先訓練一個模型，然後以該模型計算出的特徵重要性作為依據，挑選解釋力大的特徵。 - 參考網站 * [特徵選擇（feature selection）演算法筆記](https://www.itread01.com/content/1546133794.html) * [sklearn中的特征提取](http://d0evi1.com/sklearn/feature_selection/) * [1.13. 特征选择(Feature selection)](http://sklearn.lzjqsdd.com/modules/feature_selection.html) :::info EX_07： (1)設定變異數的門檻，抽取最佳的特徵。　　　　(2)使用卡方檢驗，抽取2個最佳的特徵。　　　　(3)使用模型選擇特徵。 ::: ``` javascript= from sklearn.datasets import load_iris from sklearn.feature_selection import VarianceThreshold from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.feature_selection import SelectFromModel from sklearn.ensemble import ExtraTreesClassifier # 載入極限隨機樹模組做特徵選擇 iris = load_iris() x, y = iris.data, iris.target x.shape sel = VarianceThreshold(threshold=0.5) # 設定變異數的閾值，若特徵的變異數低於該值，該特徵就會被移除 x_sel_1 = sel.fit_transform(x) # 利用先前建立的 VarianceThreshold 來篩選 x_sel_1.shape # 篩選後剩下的特徵 ``` ``` javascript= x_sel_2 = SelectKBest(chi2, k=2).fit_transform(x, y) x_sel_2.shape ``` ``` javascript= clf = ExtraTreesClassifier(n_estimators=10)# 決定決策數的數量 clf = clf.fit(x, y) # 訓練模型 clf.feature_importances_ # 模型訓練完會得到一個評估各特徵重要性的矩陣，值愈大愈重要 clf.feature_importances_.mean() # 特徵重要性平均值 model = SelectFromModel(clf, prefit=True) # 建立 SelectFromModel 選擇器，並載入剛剛訓練完的極限隨機樹，因為該模型已訓練過，prefit 參數要設為 True。沒有設定threshold參數會以平均值作為門檻。 x_sel_3 = model.transform(x) # 篩選特徵。 x_sel_3.shape ```