Python 進階技術整理

# Python 進階技術整理 [![](https://img.shields.io/badge/dynamic/json?color=orange&label=總觀看人數&query=%24.viewcount&url=https://hackmd.io/Q9gE7WIhSVCzvjCbqerCgg%2Finfo)]() > [name=AndyChiang][time=Tue, Feb 2, 2021 5:41 PM][color=#00CDAF] ###### tags: `Python` > 因為Python筆記寫太長了，所以有關模組的部分內容移來此篇說明XD ## Python pip 之前講得不夠清楚，所以補充一下，[中文安裝教學](https://www.maxlist.xyz/2019/07/13/pip-install-python/)。下載前先檢查有沒有安裝pip： ``` python -m pip --version ``` 如果有出現版本，就代表已經安裝過了! 如果沒安裝過，到[pip官網](https://pip.pypa.io/en/stable/installing/)跟著步驟下載。如果要用pip安裝套件，留意路徑要對! ``` C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install beautifulsoup4 ``` ## Python NumPy 是一個Python專門處理數學運算的模組，此模組所定義的陣列(ndarray)比起傳統Python的列表快上50倍! ### 引用 Numpy 如果您已經在系統上安裝了Python和PIP，則安裝NumPy非常容易。 ``` # CMD C:\Users\Your Name>pip install numpy ``` 安裝完後就可以引用此模組了。 ``` import numpy as np ``` 習慣上會將 numpy 簡寫成 np。 ### 創立 ndarray 使用 array() 函數，參數可以是任意型態的數組，回傳型態為 ndarray。 ``` arr = np.array([1, 2, 3, 4, 5]) ``` ### 陣列維度 #### 一維陣列 ``` arr = np.array([1, 2, 3, 4, 5]) ``` #### 二維陣列 ``` arr = np.array([[1, 2, 3], [4, 5, 6]]) ``` #### 三維陣列 ``` arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]]) ``` #### 檢查維度 ``` d = arr.ndim ``` ### 取得陣列元素和數組差不多，只有二維以上比較特別。 ``` arr1[2] # 一維 arr2[1, 5] # 二維 ``` ### 陣列切割 #### 一維陣列 ``` arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[1:5:2]) ``` 從1切到4(不包括5)，2為間隔。 #### 二維陣列 ``` arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]) print(arr[0:2, 1:4]) ``` 從二維取出0到1的一維陣列，再從一維陣列切出1到3的值，因此回傳值為兩個長度三的一維陣列。 ### NumPy 數據型態 * i：整數 * f：浮點數 * S：字串 * 更多... #### 檢查數據型態 ``` arr.dtype ``` #### 轉換數字類型 ``` newarr = arr.astype('i') ``` ### 複製與預覽(copy/view) #### 複製產生副本，更改副本不會影響原始陣列，更改原始陣列也不會影響副本。 ``` arr = np.array([1, 2, 3, 4, 5]) x = arr.copy() ``` #### 預覽產生預覽，更改預覽會影響原始陣列，更改原始陣列也會影響預覽。 ``` arr = np.array([1, 2, 3, 4, 5]) x = arr.view() ``` #### 區分複製與預覽使用 `.base` 檢查： * 如果是複製，回傳None * 如果是預覽，回傳原陣列 ### 陣列形狀 ``` arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]]) print(arr.shape) # (2, 4) ``` #### 重塑形狀 ##### 一維轉二維 ``` arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) newarr = arr.reshape(4, 3) # 4*3的矩陣 ``` 產生的陣列為**預覽(view)** ##### 一維轉三維 ``` arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) newarr = arr.reshape(2, 3, 2) # 2*3*2的矩陣 ``` ##### 任意維轉一維 ``` newarr = arr.reshape(-1) ``` -1 是未知數，電腦會自動幫你補上對的數字。 ### 走訪陣列可以用之前的For迴圈： ``` arr = np.array([[1, 2, 3], [4, 5, 6]]) for x in arr: for y in x: print(y) ``` 但如果碰上高維度的陣列，相對變得麻煩，因此可以使用 `.nditer()` 函數。 ``` arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) for x in np.nditer(arr): print(x) ``` ### 列舉陣列使用 `.ndenumerate()` 列舉出所有元素以及它的索引值。 ``` arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]]) for idx, x in np.ndenumerate(arr): print(idx, x) ``` ### 連接陣列使用 `np.concatenate()` 函數，可決定要連接的軸心。 ``` arr = np.concatenate((arr1, arr2), axis=1) ``` 使用 `np.stack()` 函數，可決定要連接的軸心。 ``` arr = np.stack((arr1, arr2), axis=1) ``` ### 拆分陣列使用 `np.array_split()` 函數，參數為要拆分的陣列以及拆成幾份。 ``` newarr = np.array_split(arr, 3) ``` ### 搜尋陣列使用 `np.where(condition)` 函數，傳回符合條件的元素的索引值，型態為組合(tuple)。 ``` arr = np.array([1, 2, 3, 4, 5, 4, 4]) x = np.where(arr == 4) ``` 使用 `np.searchsorted()` 函數，傳回指定值適合插入在陣列的索引值，插入後不會影響原排序。預設從左邊開始搜尋，可改變 `side='right'` 從右邊開始搜尋。 ``` arr = np.array([6, 7, 8, 9]) x = np.searchsorted(arr, 7, side='right') ``` ### 排序陣列使用 `np.sort()` 函數，回傳值為排序好的陣列副本。 ``` arr = np.array([3, 2, 0, 1]) print(np.sort(arr)) ``` ### 篩選陣列在NumPy中，會使用 **布林值陣列** 進行篩選： ``` arr = np.array([41, 42, 43, 44]) x = [True, False, True, False] newarr = arr[x] print(newarr) # [41, 43] ``` * 相同索引值對應到 **True** 會留下。 * 對應對 **False** 則會捨棄。 #### 產生布林值陣列 NumPy提供了一個方便的功能，只要 `布林值陣列 = 原陣列condition` ，模組會依照條件自動產生一組布林值陣列。 ``` arr = np.array([41, 42, 43, 44]) filter_arr = arr > 42 # [False False True True] newarr = arr[filter_arr] ``` ### NumPy 隨機從 numpy 中引用 random 模組： ``` from numpy import random ``` #### 產生隨機整數 ``` x = random.randint(100) # 產生0到100的隨機整數 ``` #### 產生隨機整數陣列 size 參數可決定陣列大小。 ``` x=random.randint(100, size=(3, 5)) # 產生3*5隨機整數陣列 ``` #### 產生隨機浮點數 ``` x = random.rand() # 產生0到1之間的隨機浮點數 ``` #### 產生隨機浮點數陣列 ``` x = random.rand(3, 5) # 產生3*5隨機浮點數陣列 ``` #### 隨機抽取 ``` x = random.choice([3, 5, 7, 9]) # 從3、5、7、9中隨機擇一 ``` #### 產生隨機抽取陣列 size 參數可決定陣列大小。 ``` x = random.choice([3, 5, 7, 9], size=(3, 5)) # 由3、5、7、9隨機組成的3*5陣列 ``` #### 機率分配 p 陣列代表機率，機率值從0~1，0是完全不會出現，1是一定會出現，機率總和記得要為1。 ``` x = random.choice([3, 5, 7, 9], p=[0.1, 0.3, 0.6, 0.0], size=(100)) ``` #### 隨機打散陣列使用 `shuffle()` 隨機打散陣列，原陣列會改變。 ``` arr = np.array([1, 2, 3, 4, 5]) random.shuffle(arr) print(arr) ``` 使用 `permutation()` 隨機打散陣列，原陣列不會改變。 ``` arr = np.array([1, 2, 3, 4, 5]) print(random.permutation(arr)) ``` #### 機率分布圖有太多分布了，而且不是那麼重要，有需要可以查底下連結：[更多分布](https://www.w3schools.com/python/numpy_random_normal.asp) 可以搭配 Matplotlib 印出分布圖，那部分之後再說明。 ### NumPy ufuncs NumPy ufuncs 實作出像量化，因此會比正常的運算還要更快。 #### 自定義 ufuncs 使用 `frompyfunc(<函數名稱>, <輸入參數(數組)的數量>, <輸出數組的數量>)`，如下： ``` def myadd(x, y): return x+y myadd = np.frompyfunc(myadd, 2, 1) print(myadd([1, 2, 3, 4], [5, 6, 7, 8])) ``` #### 基礎運算當然原先Python提供的加減乘除也可以用，但用 ufuncs 會跑得更快! ##### 加法 ``` newarr = np.add(arr1, arr2) ``` ##### 減法 ``` newarr = np.subtract(arr1, arr2) ``` ##### 乘法 ``` newarr = np.multiply(arr1, arr2) ``` ##### 除法 ``` newarr = np.divide(arr1, arr2) ``` ps. 無法整除的會返回浮點數。 ##### 次方 ``` newarr = np.power(arr1, arr2) ``` ##### 餘數 ``` newarr = np.mod(arr1, arr2) ``` ``` newarr = np.remainder(arr1, arr2) ``` ##### 絕對值 ``` newarr = np.absolute(arr) ``` #### 捨入小數 ##### Truncation 直接刪除小數後的整數部分。 ``` arr = np.trunc([-3.1666, 3.6667]) ``` ##### 四捨五入第二個參數可指定要四捨五入到第幾位。 ``` arr = np.around(3.1666, 2) ``` ##### 無條件捨去 ``` arr = np.floor([-3.1666, 3.6667]) ``` ##### 無條件進位 ``` arr = np.ceil([-3.1666, 3.6667]) ``` #### 對數 ##### 取log2 ``` np.log2(arr) ``` ##### 取log10 ``` np.log10(arr) ``` ##### 取log，以e(自然對數)為基底 ``` np.log(arr) ``` #### 取總和(sum) ``` arr1 = np.array([1, 2, 3]) arr2 = np.array([1, 2, 3]) newarr = np.sum([arr1, arr2]) print(newarr) # (1+2+3)+(1+2+3)=12 ``` #### 取總乘積(product) ``` arr = np.array([1, 2, 3, 4]) x = np.prod(arr) print(x) # 1*2*3*4=24 ``` #### 取項目差(difference) ``` arr = np.array([10, 15, 25, 5]) newarr = np.diff(arr) print(newarr) # [15-10, 25-15, 5-25] = [5, 10, -20] ``` #### 取最小公倍數(LCM) ##### 數字 ``` num1 = 4 num2 = 6 x = np.lcm(num1, num2) # 12 ``` ##### 陣列 ``` arr = np.array([3, 6, 9]) x = np.lcm.reduce(arr) # 18 ``` #### 取最大公因數 ##### 數字 ``` num1 = 6 num2 = 9 x = np.gcd(num1, num2) # 3 ``` ##### 陣列 ``` arr = np.array([20, 8, 32, 36, 16]) x = np.gcd.reduce(arr) # 4 ``` #### 三角函數 ``` x = np.sin(np.pi/2) x = np.cos(np.pi/2) x = np.tan(np.pi/2) ``` 注意到三角函數的參數都是弧度，因此必須先把角度(degree)換成弧度(radian)： ``` x = np.deg2rad(90) ``` 反過來也可以： ``` x = np.rad2deg(np.pi/2) ``` 給sin、cos、tan值反求角度： ``` x = np.arcsin(1.0) x = np.arccos(1.0) x = np.arctan(1.0) ``` #### 集合(set) 使用 `unique()` 將任意陣列轉為集合，重複項將刪除。 ``` arr = np.array([1, 1, 1, 2, 3, 4, 5, 5, 6, 7]) x = np.unique(arr) # [1 2 3 4 5 6 7] ``` 關於集合的操作前面講過了，這邊只放指令 * `newarr = np.union1d(arr1, arr2)`：聯集(union) * `newarr = np.intersect1d(arr1, arr2, assume_unique=True)`：交集(intersection) * `newarr = np.setdiff1d(set1, set2, assume_unique=True)`：差(difference) * `newarr = np.setxor1d(set1, set2, assume_unique=True)`：對稱差(symmetric difference) * `assume_unique=True`：此屬性可以加速運算，寫就對了! ## Python Matplotlib 是專門用來繪製數據圖的Python模組。 ### 引用 Matplotlib 如果您已經在系統上安裝了 Python 和 PIP，則 Matplotlib 的安裝非常簡單。 ``` # CMD C:\Users\Your Name>pip install matplotlib ``` 大多數 Matplotlib 實用程序位於 pyplot 子模組下，並且通常以 plt 別名導入： ``` import matplotlib.pyplot as plt ``` ### 印出圖表 ``` plt.show() ``` ### 繪製座標圖使用 `plot()` 函數在xy座標上繪製點或線。 ``` xpoints = np.array([1, 8]) ypoints = np.array([3, 10]) plt.plot(xpoints, ypoints) ``` plot() 第一個參數代表x軸上的座標，第二個參數代表y軸上的座標。如果只給一組參數，函數會預設x軸座標為 [0,1,2,3...] #### 只有點參數多加上 "o" ``` plt.plot(xpoints, ypoints, 'o') ``` #### 格式化樣式(fmt) `marker|line|color` 依序為 `端點形狀|線樣式|點和線的顏色` ``` plt.plot(ypoints, 'o:r') # 原點|點狀虛線|紅色 ``` 樣式有很多，可以去以下連結找：[Matplotlib 標記格式](https://www.w3schools.com/python/matplotlib_markers.asp) ### 端點格式 #### 端點大小可修改參數 markersize(或ms) 設定端點大小。 ``` plt.plot(ypoints, marker = 'o', ms = 20) ``` #### 端點顏色可修改參數 markeredgecolor(或mec) 設定端點邊緣顏色。 ``` plt.plot(ypoints, marker = 'o', ms = 20, mec = 'r') ``` 可修改參數 markerfacecolor(或mfc) 設定端點邊緣內顏色。 ``` plt.plot(ypoints, marker = 'o', ms = 20, mfc = 'r') ``` 顏色可使用十六進位制色碼或指定顏色名稱。 ### 線格式 #### 線樣式可修改參數 linestyle(或ls) 設定線的樣式。 ``` plt.plot(ypoints, linestyle = 'dotted') ``` #### 線顏色可修改參數 color(或c) 設定線的顏色。 ``` plt.plot(ypoints, color = 'r') ``` 顏色可使用十六進位制色碼或指定顏色名稱。 #### 線寬可修改參數 linewidth(或lw) 設定線的寬度。 ``` plt.plot(ypoints, linewidth = '20.5') ``` ### 標籤與標題 #### 設定標籤 ``` plt.xlabel("This is X axis") plt.ylabel("This is Y axis") ``` #### 設定標題 ``` plt.title("This is title") ``` 可使用 `loc` 參數來修改標題位置： ``` plt.title("This is title", loc = 'left') ``` * center：置中(預設值) * left：靠左 * right：靠右 #### 更改標籤樣式 ``` font1 = {'family':'serif','color':'blue','size':20} font2 = {'family':'serif','color':'red','size':15} plt.title("This is title", fontdict = font1) plt.xlabel("This is X axis", fontdict = font2) plt.ylabel("This is Y axis", fontdict = font2) ``` ### 網格線(grid line) 可修改參數 `axis` 設定網格線類型： ``` plt.grid(axis = 'x') ``` * x * y * both(預設值) 網格線也可以更改格式，方法和線一樣。 ### 多圖 #### 顯示多個圖使用 `subplots()` ，一口氣顯示多個圖。 ``` #plot 1: x = np.array([0, 1, 2, 3]) y = np.array([3, 8, 1, 10]) plt.subplot(1, 2, 1) plt.plot(x,y) #plot 2: x = np.array([0, 1, 2, 3]) y = np.array([10, 20, 30, 40]) plt.subplot(1, 2, 2) plt.plot(x,y) ``` 參數代表：`subplot(多圖有幾列, 有幾行, 這是第幾張圖)` #### 主標籤使用 `suptitle()` 顯示整張多圖的主標籤。 ``` plt.suptitle("MY SHOP") ``` ### 散點圖 #### 建立散點圖使用 `scatter()` 必須有兩個長度相同的陣列，第一個當x軸座標，第二個當y軸座標。 ``` x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6]) y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86]) plt.scatter(x, y) plt.show() ``` #### 設定散點圖顏色可修改參數 color(或c) 設定散點圖的顏色。 ``` plt.scatter(x, y, color = '#88c999') plt.scatter(x, y, color = 'hotpink') ``` #### 色卡(ColorMap) 使用 cmap 參數設定色卡，另外必須有 color 陣列標記色卡百分比(0~100)，長度必須與散點圖一致，參數限定使用 `c`。 ``` x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6]) y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86]) colors = np.array([0, 10, 20, 30, 40, 45, 50, 55, 60, 70, 80, 90, 100]) plt.scatter(x, y, c=colors, cmap='viridis') ``` 使用 `plt.colorbar()` 更可以將色卡印在散點圖旁邊。 [更多色卡](https://www.w3schools.com/python/matplotlib_scatter.asp) #### 設定散點圖大小可修改參數 s 設定散點圖的大小，如果想讓每個點的大小都不一樣，請給定一個長度和散點圖一樣的陣列，陣列內的值為點的大小。 ``` x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6]) y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86]) sizes = np.array([20,50,100,200,500,1000,60,90,10,300,600,800,75]) plt.scatter(x, y, s=sizes) ``` #### 設定散點圖透明度可修改參數 alpha 設定散點的透明度，值為(0~1)，數字越小越透明。 ``` plt.scatter(x, y, alpha=0.5) ``` ### 直條圖 #### 建立直條圖使用 `bar()` 函數建立直條圖。 ``` x = np.array(["A", "B", "C", "D"]) y = np.array([3, 8, 1, 10]) plt.bar(x,y) ``` 使用 `barh()` 函數建立直條圖(橫式)。 ``` x = np.array(["A", "B", "C", "D"]) y = np.array([3, 8, 1, 10]) plt.barh(x,y) ``` #### 設定直條圖顏色改顏色方法和散點圖一樣。 #### 設定直條寬度(或高度?) 使用 width 參數設定直條寬度(預設為0.8)，但如果是橫式則使用 height 參數(預設也為0.8)。 ### 直方圖使用 `hist()` 函數建立直方圖，常用來表示高斯分布。 ``` x = np.random.normal(170, 10, 250) # 產生高斯分布 plt.hist(x) ``` ### 圓餅圖使用 `pie()` 函數建立圓餅圖，值加起來必須為100。 ``` y = np.array([35, 25, 25, 15]) plt.pie(y) ``` #### 加入標籤使用 labels 參數設定圓餅區塊標籤，通常指向一個標籤陣列。 ``` y = np.array([35, 25, 25, 15]) mylabels = ["Apples", "Bananas", "Cherries", "Dates"] plt.pie(y, labels = mylabels) ``` #### 起始角度使用 startangle 參數設定圓餅圖起始角度，預設為0。 ![](https://i.imgur.com/HOHF8ru.png =60%x) ``` y = np.array([35, 25, 25, 15]) mylabels = ["Apples", "Bananas", "Cherries", "Dates"] plt.pie(y, labels = mylabels, startangle = 90) ``` #### 凸顯效果(爆炸效果!?) 使用 explode 參數設定圓餅圖區塊的凸顯效果，數值代表離圓心的距離，預設為0。 ``` y = np.array([35, 25, 25, 15]) mylabels = ["Apples", "Bananas", "Cherries", "Dates"] myexplode = [0.2, 0, 0, 0] plt.pie(y, labels = mylabels, explode = myexplode) ``` #### 陰影使用 shadows 參數設定圓餅圖是否有陰影，預設為False。 ``` plt.pie(y, labels = mylabels, explode = myexplode, shadow = True) ``` #### 設定顏色改顏色方法和散點圖一樣。 #### 圖示說明使用 `legend()` 函數，加入圖示說明在圖上。另外，使用 title 屬性設定圖示說明的標題。 ``` y = np.array([35, 25, 25, 15]) mylabels = ["Apples", "Bananas", "Cherries", "Dates"] plt.pie(y, labels = mylabels) plt.legend(title = "Four Fruits:") ``` ## Python SciPy 在 Python 中專門處理科學記號或運算的模組。 ### 引用 SciPy 如果您已經在系統上安裝了 Python 和 PIP，則 SciPy 的安裝非常簡單。 ``` # CMD pip install scipy ``` 安裝完即可直接引用。 ``` import scipy ``` 可以印出 `scipy.__version__` 檢查是否安裝成功。 ### SciPy 常數常數位於 scipy 底下的 constants 子模組中 ``` from scipy import constants ``` 常數有非常多，大致分類成這樣： * Metric，公制，單位為公尺(meter)，例如：kilo * Binary，二進位，單位為bytes * Mass，重量，單位為公斤(kg) * Angle，角度，單位為弧度(radians) * Time，時間，單位為秒(seconds) * Length，長度，單位為公尺(meter) * Pressure，壓力，單位為帕斯卡(pascals) * Area，面積，單位為平方公尺(m^2^) * Volume，體積，單位為立方公尺(m^3^) * Speed，速度，單位為每秒幾公尺(meters per second) * Temperature，溫度，單位為凱式溫標(K) * Energy，能量，單位為焦耳(J) * Power，功率，單位為瓦特(W) * Force，力，單位為牛頓(newton) * [更多常數...](https://www.w3schools.com/python/scipy_constants.asp) ### SciPy 優化 #### 求方程式的解使用 `scipy.optimze.root` 可求得該方程式的解，第一個參數為方程式，第二個參數為解的猜測值。 ``` from scipy.optimize import root from math import cos def eqn(x): return x + cos(x) myroot = root(eqn, 0) print(myroot.x) # -0.73908513 ``` #### 簡化方程式使用 `scipy.optimize.minimize()` 可簡化方程式，第一個參數為方程式，第二個參數為解的猜測值，第三個參數為簡化方法。 ``` from scipy.optimize import minimize def eqn(x): return x**2 + x + 2 mymin = minimize(eqn, 0, method='BFGS') print(mymin) ``` ### SciPy 稀疏陣列(sparse data) 有 CSC 和 CSR 兩種處理方式，範例都使用 CSR。 #### 建立 CSR 陣列必須引用 `scipy.sparse.csr_matrix()` 函數，將一般陣列轉為 CSR 陣列。 ``` arr = np.array([0, 0, 0, 0, 0, 1, 1, 0, 2]) print(csr_matrix(arr)) # (0, 5) 1 # (0, 6) 1 # (0, 8) 2 ``` #### 非零值 ``` arr = np.array([[0, 0, 0], [0, 0, 1], [1, 0, 2]]) print(csr_matrix(arr).data) ``` #### 非零值數量 ``` arr = np.array([[0, 0, 0], [0, 0, 1], [1, 0, 2]]) print(csr_matrix(arr).count_nonzero()) ``` ### Scipy 圖(Graphs) 使用前必須引用 `scipy.sparse.csgraph` 模組。 #### 鄰接矩陣(Adjacency Matrix) 就是資料結構的 Adjacency Matrix，通常是 CSR 陣列。 #### 連接組件(Connected Components) Connected Components 指的是有連接的最大子圖(Subgragh)。 ``` import numpy as np from scipy.sparse.csgraph import connected_components from scipy.sparse import csr_matrix arr = np.array([ [0, 1, 2], [1, 0, 0], [2, 0, 0] ]) newarr = csr_matrix(arr) print(connected_components(newarr)) ``` #### 走訪圖的演算法 ##### Dijkstra 演算法 ``` import numpy as np from scipy.sparse.csgraph import dijkstra from scipy.sparse import csr_matrix arr = np.array([ [0, 1, 2], [1, 0, 0], [2, 0, 0] ]) newarr = csr_matrix(arr) print(dijkstra(newarr, return_predecessors=True, indices=0)) ``` ##### Floyd Warshall 演算法 ``` import numpy as np from scipy.sparse.csgraph import floyd_warshall from scipy.sparse import csr_matrix arr = np.array([ [0, 1, 2], [1, 0, 0], [2, 0, 0] ]) newarr = csr_matrix(arr) print(floyd_warshall(newarr, return_predecessors=True)) ``` ##### Bellman Ford 演算法一樣是走訪圖，但此演算法可以應付邊權重為負的情況： ``` import numpy as np from scipy.sparse.csgraph import bellman_ford from scipy.sparse import csr_matrix arr = np.array([ [0, -1, 2], [1, 0, 0], [2, 0, 0] ]) newarr = csr_matrix(arr) print(bellman_ford(newarr, return_predecessors=True, indices=0)) ``` ##### 深度優先演算法(DFS) 第一個參數為圖，第二個參數為起始點。 ``` import numpy as np from scipy.sparse.csgraph import depth_first_order from scipy.sparse import csr_matrix arr = np.array([ [0, 1, 0, 1], [1, 1, 1, 1], [2, 1, 1, 0], [0, 1, 0, 1] ]) newarr = csr_matrix(arr) print(depth_first_order(newarr, 1)) ``` ##### 廣度優先演算法(BFS) ``` import numpy as np from scipy.sparse.csgraph import breadth_first_order from scipy.sparse import csr_matrix arr = np.array([ [0, 1, 0, 1], [1, 1, 1, 1], [2, 1, 1, 0], [0, 1, 0, 1] ]) newarr = csr_matrix(arr) print(breadth_first_order(newarr, 1)) ``` ### SciPy Matlab 必須引用 `scipy.io` 模組。 #### 匯出 Matlab 格式有三個參數：第一個參數是檔名，第二個參數是包含數組的字典，第三個參數是布林值，是否要壓縮(預設為False)。 ``` from scipy import io import numpy as np arr = np.arange(10) io.savemat('arr.mat', {"vec": arr}) ``` #### 引入 Matlab 格式 ``` mydata = io.loadmat('arr.mat') ``` > SciPy 之後的東西太難了，看不懂QQ~，就只寫到這裡。以後學會了再補寫(?) > [連結](https://www.w3schools.com/python/scipy_interpolation.asp) ## Python Pandas ### 引用Pandas 使用pip安裝用以下指令： ``` pip install pandas ``` 在Python程式中引用： ``` import pandas as pd ``` ### 建立Series 在`Series()`中放入列表型態，轉為Series型態。 ```python= s1 = pd.Series([2, 1, 7, 4]) print(s1) ``` 執行結果： ``` 0 2 1 1 2 7 3 4 dtype: int64 ``` 索引值預設為0, 1, 2...，也可以用`index=`設定索引值。 ```python= s2 = pd.Series([20, 10, 70, 40], index=["小明", "小美", "小王", "小智"]) print(s2) ``` 執行結果： ``` 小明 20 小美 10 小王 70 小智 40 dtype: int64 ``` ### 建立DataFrame ```python= data = { 'name': ['王小郭', '張小華', '廖丁丁', '丁小光'], 'email': ['min@gmail.com', 'hchang@gmail.com', 'laioding@gmail.com', 'hsulight@gmail.com'], 'grades': [60, 77, 92, 43] } df = pd.DataFrame(data, index=["A", "B", "C", "D"]) ``` 字典中key是column欄位的名稱，value則是一個iterable，值為欄位的數值。 index預設為0, 1, 2,...，可使用index=list改變index內容。 ### 印出DataFrame資料 #### 印出全部資料 ``` print(df) ``` #### 印出前n筆資料 ``` print(df.head(n)) ``` #### 印出後n筆資料 ``` print(df.tail(n)) ``` #### 印出資料型態等資訊 ``` print(df.info()) ``` #### 印出統計資訊 ``` print(df.describe()) ``` #### 印出index資訊 ``` print(df.index) ``` #### 印出欄位資訊 ``` print(df.columns) ``` ### 篩選資料 #### `df["屬性名稱"]` ``` print(df["num"]) ``` #### df."屬性名稱" ``` print(df.num) ``` #### `df.iloc[row, col]` iloc參數為索引值(整數) ``` print(df.iloc[0, 1]) # 第1列/第2行 print(df.iloc[3, :]) # 第3列全部 print(df.iloc[:, 1]) # 第1行全部 ``` #### `df.loc[row, col]` loc參數為欄位名稱(字串) ``` print(df.iloc["num"]) ``` #### loc/iloc + boolean array 篩選num介於11到19間的列。 ``` out_df = df.iloc[[(x > 10 and x < 20) for x in df["num"]]] print(out_df) ``` ### 排序資料 #### 依照index排序 * axis = 0，以列排序；axis = 1，以欄排序。 * ascending = True，升冪；ascending = False，降冪。 ``` print(df.sort_index()) ``` #### 依照欄位數值排序 * by = "col"，根據col欄位排序。 ``` print(df.sort_values(by="num")) ``` ### 重新命名欄位 ``` rename_dic = {"col 1": "x", "col 2": "10x"} df = df.rename(rename_dic, axis=1) ``` 輸入dict為要更改的名稱，axis=0是row，axis=1才是column。 ### 刪除欄位第一個參數為刪除之欄位名稱。 ``` df.drop(['grades'], axis=1) ``` ### 處理 NA/NaN 值當讀進的資料有遺漏項時就會出會NaN值。 #### 填補空值 ``` fill_df1 = df.fillna(0) # 全部NaN填補0 print(fill_df1) print("-----") fill_df2 = df.fillna({"shop name": "None", "market size": 0}) # 依照欄位填補None或0 print(fill_df2) print("-----") ``` 執行結果： ``` shop id shop name maket size 0 1 Wal mart 3000000.0 1 2 Costco 2000000.0 2 3 0 1500000.0 3 4 Pchome 300000.0 4 5 Yahoo 0.0 ----- shop id shop name maket size 0 1 Wal mart 3000000.0 1 2 Costco 2000000.0 2 3 None 1500000.0 3 4 Pchome 300000.0 4 5 Yahoo NaN ----- ``` #### 刪除空值 ``` drop_df = df.dropna() # 刪除空值的整行 print(drop_df) print("-----") ``` 執行結果： ``` shop id shop name maket size 0 1 Wal mart 3000000.0 1 2 Costco 2000000.0 3 4 Pchome 300000.0 ----- ``` ### 列合併和行合併 #### concat(列合併) ```python= data_1 = { 'name': ['王小郭', '張小華', '廖丁丁', '丁小光'], 'email': ['min@gmail.com', 'hchang@gmail.com', 'laioding@gmail.com', 'hsulight@gmail.com'], 'grades': [60, 77, 92, 43] } data_2 = { 'name': ['黃明明', '汪新新', '鮑呱呱', '江組組'], 'email': ['ww@gmail.com', 'cc@gmail.com', 'bb@gmail.com', 'ee@gmail.com'], 'grades': [70, 17, 32, 43] } df_1 = pd.DataFrame(data_1) df_2 = pd.DataFrame(data_2) df_3 = pd.concat([df_1, df_2]) print(df_3) ``` 執行結果： ``` name email grades 0 王小郭 min@gmail.com 60 1 張小華 hchang@gmail.com 77 2 廖丁丁 laioding@gmail.com 92 3 丁小光 hsulight@gmail.com 43 0 黃明明 ww@gmail.com 70 1 汪新新 cc@gmail.com 17 2 鮑呱呱 bb@gmail.com 32 3 江組組 ee@gmail.com 43 ``` 如果是 `pd.concat([df_1, df_2], ignore_index=True)`，則index會合併後重新排列。執行結果： ``` name email grades 0 王小郭 min@gmail.com 60 1 張小華 hchang@gmail.com 77 2 廖丁丁 laioding@gmail.com 92 3 丁小光 hsulight@gmail.com 43 4 黃明明 ww@gmail.com 70 5 汪新新 cc@gmail.com 17 6 鮑呱呱 bb@gmail.com 32 7 江組組 ee@gmail.com 43 ``` #### merge(行合併) ```python= data_1 = { 'name': ['王小郭', '張小華', '廖丁丁', '丁小光'], 'email': ['min@gmail.com', 'hchang@gmail.com', 'laioding@gmail.com', 'hsulight@gmail.com'], 'grades': [60, 77, 92, 43] } data_2 = { 'name': ['王小郭', '張小華', '廖丁丁', '丁小光'], 'age': [19, 20, 32, 10] } df_1 = pd.DataFrame(data_1) df_2 = pd.DataFrame(data_2) df_3 = pd.merge(df_1, df_2) print(df_3) ``` 執行結果： ``` name email grades age 0 王小郭 min@gmail.com 60 19 1 張小華 hchang@gmail.com 77 20 2 廖丁丁 laioding@gmail.com 92 32 3 丁小光 hsulight@gmail.com 43 10 ``` ### 匯出資料 Pandas要匯出資料非常簡單，常見的檔案支援格式有這些，參數輸入檔案名稱即可。 ```python= data = { 'name': ['王小郭', '張小華', '廖丁丁', '丁小光'], 'email': ['min@gmail.com', 'hchang@gmail.com', 'laioding@gmail.com', 'hsulight@gmail.com'], 'grades': [60, 77, 92, 43] } df = pd.DataFrame(data) df.to_csv("data.csv") df.to_excel("data.xlsx") df.to_json("data.json") df.to_html("data.html") ``` ### 匯入檔案參數一樣輸入檔案名稱即可，如果資料有中文字要特別注意編碼問題，加上`encoding="utf-8"`，否則會出現亂碼。 ```python= df1 = pd.read_csv("data.csv") df2 = pd.read_excel("data.xlsx") df3 = pd.read_json("data.json") df4 = pd.read_html("data.html") print(df1) print(df2) print(df3) print(df4) ``` ### 測試用DataFrame #### 隨機產生 ``` pd.util.testing.makeDataFrame() ``` #### 不同型態 ``` pd.util.testing.makeMixedDataFrame() ``` ### 參考 * [簡明 Python Pandas 入門教學](https://blog.techbridge.cc/2020/09/21/python-pandas-zen-tutorial/) * [[Python] Pandas 基礎教學](https://oranwind.org/python-pandas-ji-chu-jiao-xue/) * [資料科學家的 pandas 實戰手冊：掌握 40 個實用數據技巧](https://leemeng.tw/practical-pandas-tutorial-for-aspiring-data-scientists.html) ### 實用工具 * [tqdm：了解你的數據處理進度](https://github.com/tqdm/tqdm) * [swifter：加速你的數據處理](https://github.com/jmcarpenter2/swifter) * [qgrid：即時排序、篩選及編輯你的 DataFrame](https://github.com/quantopian/qgrid) * [pandas-profiling：你的一鍵 EDA 神器](https://github.com/pandas-profiling/pandas-profiling) ## 相關文章 [Python 基礎技術整理](/3L8iXhnVR2uFC_4iJs-n3g) ## 參考網站 * [W3School Python教學](https://www.w3schools.com/python/default.asp) * [Python 官方document](https://docs.python.org/zh-tw/3/tutorial/index.html)