數據處理與視覺化:

# 數據處理與視覺化: [相關網站](https://hackmd.io/@cube/Bk9bwQppN) [Pandas python](https://medium.com/@allaboutdataanalysis/python%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E5%9B%9B-pandas-e2fdeb6808c1) ## Matplotlib: - 匯入Matplotlib模組: ```python= import matplotlib.pyplot as plt ``` Matplotlib主要功能為繪製x y座標圖且x y座標要存放在串列中才能傳給Matlibplot做繪製。 - plot語法(繪製x y 座標圖) ```python= plt.plot(listx(x座標串列),listy(y座標串列)) plt.show()# 顯示圖形 ``` - example: ```python= import matplotlib.pyplot as plt listx = [1,5,7,9,13,16] listy = [15,50,80,40,70,50] plt.plot(listx,listy) plt.show() ``` - 參數設定: - color : 設定線條顏色 - linewidth(lw) : 設定線條寬度 - linestyle(ls) : 實線(-) 虛線(--) (-.)虛點線 (:)點線 - label : 設定圖例名稱需搭配legend。 ```python import matplotlib.pyplot as plt listx = [1,5,7,9,13,16] listy = [15,50,80,40,70,50] plt.plot(listx,listy,color = 'red',lw = 5.0 ,ls = '--',label = 'food') # 單雙引號都只是把內容包成字串 plt.legend() plt.show() ``` - 同時繪製多個圖形: - 與Matlab一樣最後再畫出圖形即可 - title: 標題 - xlabel: x座標名稱 - ylabel: y座標名稱 - xlim : 設定x座標範圍 - ylim：設定ｙ座標範圍 ```python= import matplotlib.pyplot as plt listx = [1,5,7,9,13,16] listy = [15,50,80,40,70,50] plt.plot(listx,listy,color = 'red',lw = 5.0 ,ls = '--',label = 'class A') # 單雙引號都只是把內容包成字串 plt.title("class score") plt.xlabel("school number") plt.ylabel("score") plt.xlim(0,20) plt.ylim(0,100) plt.legend() listx1=[2,6,8,11,14,16] listy1=[10,40,30,50,80,60] plt.plot(listx1,listy1,color = 'blue',lw = 5.0 ,ls = '-.',label = 'class B') # 單雙引號都只是把內容包成字串 plt.title("class score") plt.xlabel("school number") plt.ylabel("score") plt.legend() plt.show() ``` - 其他圖形: - plt.bar: 柱狀圖 ```python= import matplotlib.pyplot as plt x = [1, 2, 3, 4, 5] y = [2, 9, 4, 8, 3] plt.bar(x, y) ``` - plt.pie: 散佈圖: ```python= import matplotlib.pyplot as plt x = [2, 1, 3, 4, 5] y = [9, 2, 4, 8, 3] plt.scatter(x, y) ``` ## Numpy : [python numpy axis概念整理筆記](http://changtw-blog.logdown.com/posts/895468-python-numpy-axis-concept-organize-notes) #### NumPy 的 ndarray：多維陣列物件 numpy的資料結構是n維的陣列物件，叫做ndarray。可以用這種陣列對整塊資料執行一些數學運算，其語法跟標量元素之間的運算一樣。 ```python= import numpy as np array1 = np.array([[1,2,3],[4,5,6]]) ``` - 重要參數: - ndarray.ndim : NumPy ndarray物件的維度 - ndarry.shape：ndarry物件的每一個維度的大小(size)，回傳資料類別為Tuple - ndarry.size：ndarry物件所組成之array的總元素數量，回應之數值會等於ndarray.shape的每個元素相乘 - ndarry.dtype：ndarray物件內組成元素的型態 - ndarray.itemsize：陣列中每一個元素的大小(Bytes) (ex: int16=>16/8=2 Bytes) - ndarry.data：這是一個存有實際陣列元素的緩衝，通常我們不需要使用這個屬性，因為我們可以使用index存取這些元素。 ```python= import numpy as np array1 = np.array([[1,2,3],[4,5,6]]) print(array1) print("dim is : ", array1.ndim) print("size is ", array1.size) print("shape is " , array1.shape) print("type is " , array1.dtype) ``` - 指定陣列型態: - (模組名稱).array([陣列],dtype = 模組名稱.資料型態(ex:float64)) ```python= array1 = np.array([[1,2,3],[4,5,6]],dtype = np.float64) ``` ### Numpy 維度問題: [三維陣列](https://ithelp.ithome.com.tw/articles/10215056) - Numpy矩陣宣告 : ```python= import numpy as np #example 1 array1 = np.array([[(1,2,3)],[(1,4,6)],[(5,8,9)]],dtype = np.float64) print(array1) print("dim is : ", array1.ndim) print("size is ", array1.size) print("shape is " , array1.shape) print("type is " , array1.dtype) print(array1[1][0][2]) # array1[axis = 0][axis = 1][axis = 2] #example 2 array2 = np.array([(0,1,2,3,4),(1,2,3,4,5),(5,6,7,8,9)]) # () 可以改成[] 意思是一樣的 print("dim2 is : ", array2.ndim) print("size2 is ", array2.size) print("shape2 is " , array2.shape) print("type2 is " , array2.dtype) print(array2[2][4]) ``` 由範例1來看我們的array1是一個3維矩陣但是array2則是一個2維矩陣以宣告方式來說應為: ```python= np.array([[(axis = 0)],[(axis = 1)],[(axis = 2)]]) ``` 最外側的[] 是宣告為一個矩陣，而內側的[] 表示在第一個維度中的元素為多少依序下去，若是以圖形來說的話則為: ![](https://i.imgur.com/osRo6WQ.png) 簡單來說若是要單純宣告一個二維矩陣，可以直接向範例2那樣進行宣告，但若是要宣告3維矩陣則需要用到[()] 來告訴各個維度的元素是甚麼。 - 轉換矩陣型態: astype ```python= array2 = array2.astype(np.float) #記得重新覆蓋　 print("type2 is " , array2.dtype) ``` ### Numpy 數據統計運算函數: ```python= a = np.array([3,4,5,1,2]) np.sort(a) # 排序 a.sum() # 加總，同 np.sum(a) a.mean() # 平均 a.max() # 最大值 a.min() # 最小值 a.std() # 標準差 np.mod(5,2) # 求餘數 np.sin(np.pi/2) # 求sin np.log(100) a = np.array([[1,2,3], [4,5,6]]) a.sum() # 21，axis=None或defalut，對矩陣內所有元素作運算。同 np.sum(a) a.sum(axis=0) # [1+4, 2+5, 3+6]。同 np.sum(a, axis=0) a.sum(axis=1) # [1+2+3, 4+5+6]。同 np.sum(a, axis=1) ``` ``` ndarray[:][0:2] 表示軸0的項目全部都要,但只取每一項的軸1的0到1項 ``` ![](https://i.imgur.com/r9FhXFQ.png) - array.sum(axis = 0) = ndarray[:][0]+ndarray[:][1]+ndaaray[:][2].... - array.sum(axis = 1) = ndarray[0][:]+ndarray[1][:]+ndaaray[2][:].... ```python= array2 = np.array([(0,1,2,3,4),(1,2,3,4,5),(5,6,7,8,9)]) first_axis_sum = array2.sum(axis =1) print("sum is {0}".format(first_axis_sum)) #　sum is [10. 15. 35.] ``` ![](https://i.imgur.com/lMneKAv.png) ```python= import numpy as np array1 = np.array([[(1,2,3)],[(1,4,6)],[(5,8,9)],[(8,9,10)]],dtype = np.float64) print(array1) print("dim is : ", array1.ndim) print("size is ", array1.size) print("shape is " , array1.shape) print("type is " , array1.dtype) print(array1[3][0][2]) array_sum_axis2_mean0 = array_sum_axis2[0]/3 print("axis 2 =>0",array_sum_axis2) array2 = np.array([[(6,15,18),(2,5,6),(4,10,12),(1,8,5)],[(7,15,18),(9,5,6),(1,10,12),(5,8,5)]]) print("dim2 is : ", array2.ndim) print("size2 is ", array2.size) print("shape2 is " , array2.shape) print("type2 is " , array2.dtype) array2 = array2.astype(np.float) print("type2 is " , array2.dtype) first_axis_sum = array2.sum(axis =2) print("sum is {0}".format(first_axis_sum)) [[[ 1. 2. 3.]] [[ 1. 4. 6.]] [[ 5. 8. 9.]] [[ 8. 9. 10.]]] dim is : 3 size is 12 shape is (4, 1, 3) type is float64 10.0 axis 2 =>0 [[13 38 41] [22 38 41]] dim2 is : 3 size2 is 24 shape2 is (2, 4, 3) type2 is int32 type2 is float64 sum is [[39. 13. 26. 14.] [40. 20. 23. 18.]] ``` axis 2 的方向的各項相加從第0項往第最後一項相加 ## 總結: 假設今天有一個這樣的矩陣: np.array([[(6,15,18),(2,5,6),(4,10,12),(1,8,5)],[(7,15,18),(9,5,6),(1,10,12),(5,8,5)]]) - axis = 0 :先看row 在[] 裡面有幾個 [] 最外面用逗號隔開的就代表有幾列(row) - axis = 1 :再看[] 裡面有幾個[] 或() 就代表有幾個column - axis = 2 :最後看裡面的元素有幾個就代表有幾個page ### 陣列快速生成法: - np.linspace(start, stop, size, endpoint) * 在start~stop範圍，產生均勻間隔、指定數量的元素。endpoint=False 不包含stop。 - np.arange(start, stop, step) * 在start~stop範圍，根據間隔大小(step)生成元素 * 類似python range，回傳型態為ndarray (range回傳型態為list) - np.random.randint(low, high, size) * 在low~high(不包含)內，隨機產生指定數量的元素 ``` javascript= import numpy as np np.zeros(3) # 建立皆為0的1X3陣列 np.zeros((3,5)) # 建立皆為0的3X5陣列 np.zeros((3,5), dtype=np.int) np.ones(3) # 建立皆為1的1X3陣列 np.full((3,5), 7) # 建立皆為7的3X5陣列 np.eye(3) # 建立3x3的單位矩陣(identity matrix) np.diag([1,2,3]) # 建立3X3陣列，對角為1,2,3 np.random.randint(0, 10, 9) # 建立1維陣列，9個0~9之間的亂數 np.random.randint(0, 10, (3,3)) # 建立3X3陣列，內容為0~9之間的亂數 np.random.normal(0, 1, (3,3)) # 建立3X3的常態分佈矩陣，平均值為0，標準差為1 np.arange(3) # 建立 [0 1 2] np.arange(10, 20, 2) # 建立 [10, 12, 14, 16, 18] np.linspace(0, 10, 6) # 0~10平均取6個點 xy = [[x, y] for x in range(5) for y in range(3)] np.array(xy) ``` 透過linspace產生出1維矩陣後再透過reshape產生二維矩陣例如: ```python= array4 = np.linspace(0,50,15).reshape(3,5) ``` ### 矩陣操作(matrix operation): 在做這些操作時要確保matrix的維度相同 ```python= import numpy as np a = np.array([1,2,3,4,5,6,7,8,9]) np.cumsum(a) # a 中元素的累計和(cumulative sum) np.diff(a) # 兩兩元素差 np.clip(a,3,8) # 將元素限製在3~8間，大於8改為8，小於3改為3 a = np.array([[1,2], [3,4]]) b = np.array([[5,6], [7,8]]) np.dot(a, b) # 矩陣相乘，同 a.dob(b) a.T # 轉置矩陣，同 np.transpose(a) a_rev = np.linalg.inv(a) # 反矩陣 np.dot(a, a_rev) np.vstack((a, b)) # 垂直方向合併 #[[1 2] # [3 4] # [5 6] # [7 8]] np.hstack((a, b)) # 水平方向合併 # [[1 2 5 6] # [3 4 7 8]] a = np.array([1,2,3]) b = np.array([4,5,6]) z = np.c_[a, b] # numpy中的zip。np.c_是列向相加，列向量在axis = 1接起來 # array([[1, 4], # [2, 5], # [3, 6]]) np.r_[a, b] # np.r_是行向相加，列向量在axis = 0接起來 a = z[:,0] # unzip b = z[:,1] a = np.arange(9) np.split(a, 3) # 分割成3段 a = np.arange(8) np.array_split(a, 3) # split必須可以均分，否則會錯誤。array_split不會 ``` ```python= a = np.array([[1,2,3]]) # 需要在括號內加上[] 才是正確的矩陣宣告 print("shape is " , a.shape) b = np.array([[4,5,6]]) z = np.c_[a, b] print(z) a_1 = np.array([[1,2], [3,4]]) b_1 = np.array([[5,6], [7,8]]) z_1_axis_1 = np.c_[a_1, b_1] print(z_1_axis_1) z_2_axis_0 = np.r_[a_1,b_1] # axis = 0的方向接起來矩陣的方向也跟axis相同 print(z_2_axis_0) # z_1_axis_1 = [[1 2 5 6] # [3 4 7 8]] # z_2_axis_0= # [[1 2] # [3 4] # [5 6] # [7 8]] # ``` - np.slit: 拆出來的是list 而不是矩陣若要是矩陣需要再透過reshape ```python= a = np.arange(9) np.split(a, 3) # 分割成3段 a = np.arange(8) a_split = np.array_split(a, 3) # a_split => list # <class 'list'> ``` # Pandas :　 Python的內建資料結構list可以塞好幾種不同type的資料進去，如下圖所示，這個list裡面的資料有string, int, float，但對於機器來說，要提升效能或是提升記憶體省用效率最好有一致的型別會比較好。 ![](https://i.imgur.com/GIlNlbc.png) 當使用numpy的array資料結構會強迫把裡面的資料都轉成同一型態 ![](https://i.imgur.com/MSlLqzz.png) Pandas具有兩種資料結構: 1. Series 欄位(一維度) 2. DataFrame 表格（二維度） ## Pandas一維資料結構: Series Pandas的series的資料結構類似於list 但是他可以指定裡面的index。例如: ```python= import pandas as pd array1= [] name = ['Jeff','kurt','Jia','Mark'] for i in range(len(name)): array1.append(name[i]) print(i) print(array1) pd_1 = pd.Series([1,2,3,4],index = array1) #可以自己定義index但是一旦定義就要全部定義完成 print(pd_1) ``` :::warning ### 題外話 Python與C的不同 " = " ```python= for i in range(1,5): a[i-1] = i print(i) print(a) # 會出現Errror : IndexError: list assignment index out of range # 其原因是因為在python 中“=”只能用來修改list中已有的項，不可以用來增加新的元素 #增加新的元素，有四種方法： # append(),extend(),insert(), +加號 # 例如: import pandas as pd array1= [] name = ['Jeff','kurt','Jia','Mark'] for i in range(len(name)): array1.append(name[i]) print(i) print(array1) ``` ::: 可將Series看成是一個定長的有序字典，它是索引值到資料值的一個映射（它可以用在許多原本需要字典參數的函數中）。 **如果資料被存放在一個 python 字典中，可以直接透過這個字典來創建Series：** ```python= Alcohol = {'Jeff':'VOdka',"Brain":'NA','Michel': 'Whsiky'} # 創建字典是用{ } 矩陣則是[] # print(Alcohol) print(type(print(Alcohol))) Alcohol = pd.Series(Alcohol) print(Alcohol) print(type(Alcohol)) ``` 給所創建的Series帶有一個可以對各個數據點進行標記的索引，與普通NumPy陣列相比，可以透過索引的方式選取Series中的單個或一組值 ```python= Alcohol = {'Jeff':'VOdka',"Brain":'NA','Michel': 'Whsiky'} # 創建字典是用{ } 矩陣則是[] # print(Alcohol) print(type(print(Alcohol))) Alcohol = pd.Series(Alcohol) print(Alcohol) print(type(Alcohol)) print(Alcohol[['Jeff','Brian']]) # 多組以上就要加上[]以list做儲存 ``` Series最重要的一個功能是在**算數運算中自動對齊不同索引的資料：** ```python= sdata = {'P':3500 , 'I':2700,'D':1000,'P_1':1600,'I_1':5000} sdata2 = {'P','I','D','D_1','I_2'} obj1 = pd.Series(sdata) obj2 = pd.Series(sdata,index = sdata2) print(obj1) print('***') print(obj2) print("result is ",obj2['I_2']) # 會自動對齊但是若是兩個之間不同項則會顯示NAN """ D 1000 I 2700 I_1 5000 P 3500 P_1 1600 dtype: int64 *** P 3500.0 D 1000.0 D_1 NaN I_2 NaN I 2700.0 dtype: float64 result is D 2000.0 D_1 NaN I 5400.0 I_1 NaN # 不同項也為NAN I_2 NaN P 7000.0 P_1 NaN dtype: float64 """ ``` - Series的索引可以透過賦值的方式就地修改： ```python= Alcohol = {'Jeff':'VOdka',"Brian":'NA','Michel': 'Whsiky','Elle':'All'} # 創建字典是用{ } 矩陣則是[] # print(Alcohol) print(type(print(Alcohol))) Alcohol = pd.Series(Alcohol) print(Alcohol) print(type(Alcohol)) print(Alcohol[['Jeff','Brian']]) # 多組以上就要加上[]以list做儲存 Alcohol.index = name print(Alcohol) """ Brian NA Elle All Jeff VOdka Michel Whsiky dtype: object <class 'pandas.core.series.Series'> Jeff VOdka Brian NA dtype: object Jeff NA kurt All Jia VOdka Mark Whsiky dtype: object """ ``` ### Pandas二維資料結構: Dataframe 是一個表格型的資料結構。既有行索引也列索引。DataFrame中面向行和面向列的操作基本是平衡的。DataFrame中的資料是以一個或多個二維塊存放的。用層次化索引，將其表示為更高維度的資料。構建 DataFrame：直接傳入一個由等長清單或 NumPy 陣列組成的字典。 ![](https://i.imgur.com/J8WYnlJ.png) ```python= import pandas as pd dict = {'name': ['Jeff','Elsa','Michel','Deff'], 'year': [1955,1996,1992,1988], 'Food':['chcicken','beef','bacon','vegetable']} pd_1 = pd.DataFrame(dict) pd_1['debt']= 16.5 # 新增column print(pd_1) pd_series = pd_1['name'] # 可以透過選取coulmn來創作出新的Series print(type(pd_series)) print(pd_1.year) ``` - 我們可以透過字典創建出新的dataframe，同時透過賦值的方式進行修改： ```python= pd_1['debt']= 16.5 # 新增column ''' Food name year debt 0 chcicken Jeff 1955 16.5 1 beef Elsa 1996 16.5 2 bacon Michel 1992 16.5 3 vegetable Deff 1988 16.5 ''' ``` - 也可以透過選取coulmn來創作出新的Series ```python= pd_series = pd_1['name'] # 可以透過選取coulmn來創作出新的Series print(type(pd_series)) ''' <class 'pandas.core.series.Series'> ''' ``` - 也可以指定columns的順序: ```python= dict = {'name': ['Jeff','Elsa','Michel','Deff'], 'year': [1955,1996,1992,1988], 'Food':['chcicken','beef','bacon','vegetable']} pd_1 = pd.DataFrame(dict,columns= ['year','Food','name']) ''' year Food name debt 0 1955 chcicken Jeff 16.5 1 1996 beef Elsa 16.5 2 1992 bacon Michel 16.5 3 1988 vegetable Deff 16.5 ''' ``` - index也可以在初始化的時候進行宣告: ```python= import pandas as pd index_1 = ['one','two','three','four'] dict = {'name': ['Jeff','Elsa','Michel','Deff'], 'year': [1955,1996,1992,1988], 'Food':['chcicken','beef','bacon','vegetable']} pd_1 = pd.DataFrame(dict,columns= ['year','Food','name'],index = index_1) ''' year Food name debt one 1955 chcicken Jeff 16.5 two 1996 beef Elsa 16.5 three 1992 bacon Michel 16.5 four 1988 vegetable Deff 16.5 ''' ``` - 將清單或陣列賦值給某個列時，其長度必須跟DataFrame的長度相匹配。如果賦值的是一個Series，就會精確匹配DataFrame的索引，所有的空位都將被填上缺失值，**尤其是也會把原本的值洗成缺失值(N/A)**： ```python= import pandas as pd index_1 = ['one','two','three','four'] dict = {'name': ['Jeff','Elsa','Michel','Deff'], 'year': [1955,1996,1992,1988], 'Food':['chcicken','beef','bacon','vegetable']} pd_1 = pd.DataFrame(dict,columns= ['year','Food','name'],index = index_1) pd_1['debt']= 16.5 # 新增column print(pd_1) pd_series = pd_1['name'] # 可以透過選取coulmn來創作出新的Series print(type(pd_series)) print(pd_1.year) val = pd.Series([-1.5,-1.6],index = ['one','four']) pd_1['debt']= val print(pd_1) ''' old: year Food name debt one 1955 chcicken Jeff 16.5 two 1996 beef Elsa 16.5 three 1992 bacon Michel 16.5 four 1988 vegetable Deff 16.5 new: year Food name debt one 1955 chcicken Jeff -1.5 two 1996 beef Elsa NaN three 1992 bacon Michel NaN four 1988 vegetable Deff -1.6 ''' ``` 2. ### DataFrame - 建立 DataFrame：可以透過 Dictionary、Array 來建立，也可以讀取外部資料(CSV、資料庫等)來建立。 - DataFrame 基本資訊 | 基本資訊方法 | 說明 | |:--------------|:-------------| | df.info() | 表格的資訊 | | df.shape | 表格列數與欄數 | | df.columns | 回傳欄位名稱 | | df.index | 回傳 index | | df.head(n) | 回傳前n筆資料，省略預設為5筆 | | df.tail(n) | 回傳後n筆資料，省略預設為5筆 | | df.describe() | 回傳各類統計資料(max min std....) | | df.values | 以二維ndarray的形式返回DataFrame中的資料 | ```python= array1 = np.random.randint(60,90,7) array2 = np.random.randint(170,184,7) array3 = np.arange(1,8,1) dict_1 = {'weigh':array1, 'heigh':array2, 'name': ['Jeff','Elsa','Michel','Dell','Connie','hello','frgo']} pd_2 = pd.DataFrame(dict_1, columns =['name','heigh','weigh'],index =array3) print(pd_2) index = pd_2.index value = pd_2.values print(value) print(pd_2.describe()) ''' pd_2: name heigh weigh 1 Jeff 182 72 2 Elsa 178 84 3 Michel 181 85 4 Dell 174 82 5 Connie 178 60 6 hello 179 84 7 frgo 178 86 describe: heigh weigh count 7.000000 7.000000 mean 174.714286 71.714286 std 3.817254 8.976159 min 171.000000 62.000000 25% 172.000000 64.000000 50% 174.000000 72.000000 75% 176.000000 77.000000 max 182.000000 86.000000 ''' ``` ## Dataframe基本功能: ### reindex：創建一個適應新索引的新物件(改變行或列的名稱)。對於DataFrame ，reindex可以修改行、列索引，或兩個都修改。如果僅傳入一列，則會重新索引行： - 修改index: 但若是新的索引標籤在原本的索引標籤中並不存在的話，就會出現NA ```python= pd_2 = pd_2.reindex(index = np.random.randint(0,10,7)) # 改變 print(pd_2) ''' name heigh weigh 3 Michel 181.0 85.0 1 Jeff 182.0 72.0 7 frgo 178.0 86.0 2 Elsa 178.0 84.0 5 Connie 178.0 60.0 9 NaN NaN NaN 7 frgo 178.0 86.0 ''' ``` - 出現NaN時，可以使用`fill_value`來將空白的值填入。 ```python= pd_2 = pd_2.reindex(index = np.random.randint(0,10,7),fill_value = 0 ) # 改變 ''' 4 Dell 173 60 7 frgo 173 74 6 hello 174 61 2 Elsa 182 85 6 hello 174 61 7 frgo 173 74 9 0 0 0 ''' ``` - 出現NaN時，也可以使用`method = 'ffill(forward fill)'`來將空白的值填入，意思是如果新增加索引的值不存在，那麼按照前一個非nan的值填充進去。 - 修改Columns: - 用Series新增columns - 用if else新增columns ```python= pd_2['tall'] = pd_2['heigh']>175 pd_2['BMI'] = pd.Series(np.random.randint(20,28,7),index =array3) # 要記得給Index值否則會出現有列資料遺失的狀況 pd_2= pd_2.reindex(index =array3) # 不能中途更改不同的index reindex只能改變 index的順序不能重新改變index的屬性 ''' name heigh weigh tall BMI 1 Jeff 179 62 True 21 2 Elsa 170 74 False 23 3 Michel 174 80 False 21 4 Dell 177 86 True 26 5 Connie 175 78 False 23 6 hello 170 78 False 21 7 frgo 170 78 False 24 ''' ``` - 對DataFrame進行索引就是獲取多個列: ```python= pd_only_heigh_weigh = pd.DataFrame(pd_only_heigh_weigh,columns = ['weigh','name','heigh']) ''' weigh name heigh 0 72 Jeff 170 1 85 Elsa 182 2 75 Michel 177 3 61 Dell 171 4 73 Connie 172 5 75 hello 175 6 69 frgo 177 ''' ``` - 對Columns進行選取資料並處理需要query index才能做到正確的資料判讀: ```python= pd_only_heigh_weigh = pd_2[['name','weigh','heigh']]# 選取特定的Column並且儲存到另一個dataframe pd_only_heigh_weigh = pd.DataFrame(pd_only_heigh_weigh,columns = ['weigh','name','heigh']) #可以重新定義columns的順序 print(pd_only_heigh_weigh) for i in range(0,6): if pd_only_heigh_weigh['heigh'][i] >=175: heigh_a +=1 else : print('shorter than 175') print('higher than 175',heigh_a) ''' weigh name heigh 0 72 Jeff 170 1 85 Elsa 182 2 75 Michel 177 3 61 Dell 171 4 73 Connie 172 5 75 hello 175 6 69 frgo 177 shorter than 175 shorter than 175 shorter than 175 higher than 175 3 ''' ``` - 透過`ix`進行索引標籤選取資料: ```python= print(pd_only_heigh_weigh.ix[pd_only_heigh_weigh.heigh >175,:3]) # :3 3個columns 如果是:2 則是兩個columns """ weigh name heigh 1 78 Elsa 177 3 62 Dell 180 4 67 Connie 178 6 66 frgo 183 """ data = [[1,2,3],[4,5,6]] index = ['d','e'] columns=['a','b','c'] df = pd.DataFrame(data=data, index=index, columns=columns) print(df.ix[:3,['a','c']]) """ a c d 1 3 e 4 6 """ print(df.ix[['d'],:2]) """ a b d 1 2 """ ``` - 重新修改columns或是index的名稱：rename() 要把原本的index和columns和想要修改的index和columns做對應: ```python= pd_only_heigh_weigh= pd_only_heigh_weigh.rename(index = {0:'a',1:'b',2:'c',3:'d',4:'e',5:'f',}) print(pd_only_heigh_weigh) # rename(index= {原本:修改名稱,....}) ''' weigh name heigh a 65 Jeff 172 b 71 Elsa 182 c 76 Michel 178 d 66 Dell 177 e 65 Connie 182 f 74 hello 179 6 64 frgo 170 ''' # Columns也是一樣道理 ```