Optimizing Python Code with pandas

# Optimizing Python Code with pandas ###### tags: `Datacamp` `python` `panda` `data science` >**作者:何彥南** >Datacamp 課程: [Optimizing Python Code with pandas](https://www.datacamp.com/courses/optimizing-python-code-with-pandas) **注意:** 1. df 為 pandas 的 DataFrame 的縮寫。 2. pd 為 panda 套件的縮寫。 3. 請以官方文件 [panda doc](https://pandas.pydata.org/pandas-docs/stable/) 為主。 4. 注意panda 的版本，有些功能可能在新版無法使用。 5. 程式碼內`#`標記的地方為 output [toc] # [ch1] Select columns and rows efficiently >This chapter will give you an overview of why efficient code matters and selecting specific and random rows and columns efficiently. ## 1. Why we need efficient code and how to measure it >使用的資料 [poker.csv](https://assets.datacamp.com/production/repositories/3832/datasets/c715cfae17d00d26693da8e612cb5fbd64e69589/poker_hand.csv) ![](https://i.imgur.com/rj5U9Sz.png) ### How do we measure time? > 我們這邊使用time 套件，time.time()可以抓取當下的電腦時間 ```python= import time start_time = time.time() time.sleep(5) print("Results from the first method calculated in %s seconds" % (time.time() - start_time)) # Results from the first method calculated in 5.001059532165527 seconds ``` * 小技巧: 在要print()的字串裡面加入 `%s` 然後在字串的後面加入 `%(這邊可以做基本運算)` 結果最後會放在 `%s` 的位置輸出，這樣在 print 多個不同 type 資料的時候可以避免 TypeError 的問題。 * 像這樣:![](https://i.imgur.com/s1hmSrT.png) * 這時用上面的小技巧就沒問題啦~ ```python= print('%s a' %(1)) ``` * 這樣也是可以啦，只是多個變數的時候比較不適用。 ```python= print(1,'a') ``` * 還有另一種方式大家也可以研究看看 `'{} {}'.format( )`>>> [更多](https://pyformat.info/) ### Measuring time using time I | var() > 比較`np.var()`和 panda底下的`Serise.var()` * [numpy.var()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html) ```python= start_time = time.time() print(poker_hands['R2'].var()) # 14.190090008220789 print("Results from the pandas method calculated in %s seconds" % (time.time() - start_time)) # Results from the pandas method calculated in 0.001730203628540039 seconds ``` * [pandas.Serise.var()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.var.html) ```python= start_time = time.time() print(np.var(poker_hands['R2'])) # 14.189522631571121 print("Results from the NumPy method calculated in %s seconds" % (time.time - start_time)) # Results from the NumPy method calculated in 0.0010731220245361328 seconds ``` ### Measuring time using time II >這邊我們比較兩個function 耗的時間 >不使用 for 迴圈 ![](https://i.imgur.com/rRYBod7.png) ```python= start_time = time.time() first_method = formula() print("Results from the first method calculated in %s seconds" % (time.time() - start_time)) # Results from the first method calculated in 0.000108957290649 seconds ``` >使用 for 迴圈 ![](https://i.imgur.com/4nEx7gf.png) ```python= start_time = time.time() second_method = brute_force() print("Results from the second method calculated in %s seconds" %(time.time() - start_time)) # Results from the second method calculated in 0.174870967865 second ``` * Difference in speed: 160,394.967179% * 所以在寫程式時盡量避開for 迴圈。 ## 2. Locate rows using the .iloc() and .loc() functions ### Locate targeted rows >loc ```python= rows = range(0, 500) start_time = time.time() data.loc[rows] print("Results from the first method calculated in %s seconds" % (time.time() - start_time)) # Results from the first method calculated in 0.001951932 seconds ``` >iloc ```python= start_time = time.time() data.iloc[rows] print("Results from the first method calculated in %s seconds" % (time.time() - start_time)) # Results from the second method calculated in 0.0007140636 seconds ``` * Difference in speed: 173.355592654% * 所以 iloc 比較有效率。 ### Locate targeted columns > iloc ```python= start_time = time.time() data.iloc[:,:3] print("Results from the first method calculated in %s seconds" % (time.time() - start_time)) # Results from the first method calculated in 0.00125193595886 seconds ``` >[['column_name1','column_name2']] ```python= start_time = time.time() data[['S1', 'R1', 'S2']] print("Results from the first method calculated in %s seconds" % (time.time() - start_time)) # Results from the first method calculated in 0.000964879989624 seconds ``` * Difference in speed: 29.7504324188% * 所以在選取column 時，盡量使用 `simple` 的方式。 ## 3. Select random rows ### Sampling random rows & columns | sample()、randint() > 在隨機取樣的部分，使用sample( ) 比較快 * [pandas.DataFrame.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) * `axis=0`: 對 row * `axis=1`: 對 column * [numpy.random.randint()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randint.html#numpy.random.randint) >sample() ```python= start_time = time.time() poker.sample(100, axis=0) print("Results from the second method calculated in %s seconds" % (time.time() - start_time)) # Results from the first method calculated in 0.000750064849854 seconds ``` >iloc、np.random ```python= start_time = time.time() poker.iloc[np.random.randint(low=0, high=poker.shape[0], size=100)] print("Results from the second method calculated in %s seconds" % (time.time() - start_time)) # Results from the second method calculated in 0.00103211402893 seconds ``` * Difference in speed: 37.6033057849% > 在選取column 時 sample()也比較快 >sample() ```python= start_time = time.time() poker.sample(3, axis=1) print("Results from the second method calculated in %s seconds" %(time.time() - start_time)) # Results from the second method calculated in 0.000683069229126 seconds ``` >iloc、np.random ```python= N = poker.shape[1] start_time = time.time() poker.iloc[:,np.random.randint(low=0, high=N, size=3)] print("Results from the first method calculated in %s seconds" %(time.time() - start_time)) # Results from the first method calculated in 0.0010929107666 seconds ``` * Difference in speed: 59.9999999998 --- # [ch2] Replace values of a DataFrame using the .replace() function >This chapter shows the usage of the replace() function for replacing one or multiple values using lists and dictionaries. ![](https://i.imgur.com/sRWf2Pk.png) ## 1. Replace scalar values using .replace() ### Replace values in pandas | replace() > 替換指定的字串，使用replace()比較快 * [pandas.Series.str.replace()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html#pandas.Series.str.replace) * [pandas.DataFrame.replace()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) >loc[ ] ```python= start_time = time.time() names['Gender'].loc[names.Gender=='MALE'] = 'BOY' print("Results from the first method calculated in %s seconds" % (time.time() - start_t # Results from the first method calculated in 0.0311849 seconds ``` >replace() ```python= start_time = time.time() names['Gender'].replace('MALE', 'BOY', inplace=True) print("Results from the first method calculated in %s seconds" % (time.time() - start_time)) # Results from the first method calculated in 0.0016758441925 seconds ``` * Differerence in speed: 1,704.52411439% ## 2. Replace values using lists ![](https://i.imgur.com/OgqpTaU.png) > 在替換多個值得時候使用replac() 比較方便 > loc ```python= start_time = time.time() names['Ethnicity'].loc[(names["Ethnicity"] == 'WHITE NON HISPANIC') | (names["Ethnicity"] == 'WHITE NON HISP')] = 'WNH' print("Results from the above operation calculated in %s seconds" %(time.time() - start_time)) # Results from the second method calculated in 0.0276169776917 seconds ``` > replace() ```python= start_time = time.time() names['Ethnicity'].replace(['WHITE NON HISPANIC','WHITE NON HISP'], 'WNH', inplace=True) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 0.00144791603088 seconds ``` * Difference in speed: 2160.68681809% * inplace=True: 改變原數據(在改變特定範圍時才有差) >replace() II ```python= names['Ethnicity'].replace(['WHITE NON HISP'], 'WHITE NON HISPANIC', inplace=True) names['Ethnicity'].replace(['BLACK NON HISP'], 'BLACK NON HISPANIC', inplace=True) names['Ethnicity'].replace(['BLACK NON HISP','WHITE NON HISP'], ['BLACK NON HISPANIC', 'WHITE NON HISPANIC'], inplace=True) ``` ## 3. Replace values using dictionaries ### Replace single values with dictionaries >with dictionaries ```python= start_time = time.time() names['Gender'].replace({'MALE':'BOY', 'FEMALE':'GIRL'}, inplace=True) print("Results from the first method calculated in %s seconds" % (time.time() - start_time)) # Results from the first method calculated in 0.00197792053223 seconds ``` > 普通 replace() ```python= start_time = time.time() names['Gender'].replace('MALE', 'BOY', inplace=True) names['Gender'].replace('FEMALE', 'GIRL', inplace=True) print("Results from the first method calculated in %s seconds" % (time.time() - start_time)) # Results from the first method calculated in 0.00307083129883 seconds ``` * Difference in speed: 55.2555448407% ### Replace multiple values using dictionaries > 我們可以利用兩層 dict ，外層的key 用來指定column，而內層就像是上面對單行的 replace with dictionaries ```python= start_time = time.time() names.replace({'Ethnicity': {'ASIAN AND PACI': 'ASIAN', 'ASIAN AND PACIFIC ISLANDER': 'ASIAN', 'BLACK NON HISPANIC': 'BLACK', 'BLACK NON HISP': 'BLACK', 'WHITE NON HISPANIC': 'WHITE', 'WHITE NON HISP': 'WHITE'}}) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 0.0028018 seconds ``` * 這邊 `Ethnicity `為指定的欄位 --- # [ch3] Speed efficient methods for iterating through a DataFrame >This chapter presents different ways of iterating through a Pandas DataFrame and why vectorization is the most efficient way to achieve it. ## 1. Looping using the .iterrows() function ![](https://i.imgur.com/FZfp3gz.png) ### Generators in Python >generator是一個可以迭代的object ```python= def city_name_generator(): yield('New York') yield('London') yield('Tokyo') yield('Sao Paolo') city_names = city_name_generator() next(city_names) # 'New York' next(city_names) # 'London' next(city_names) # 'Tokyo' next(city_names) # 'Sao Paolo' next(city_names) # Traceback (most recent call last): File "<stdin>", line 1, in <module> # StopIteratio ``` * 我們也可以用type看出他是個generator ![](https://i.imgur.com/mVr6C6o.png) * 而generator是不能直接顯示出來的 ![](https://i.imgur.com/Sf3N0Va.png) ### Looping using the .iterrows() function | iterrows() > 對 dataframe 使用 iterrow() 他會返回 generator，它是可以迭代的物件 * [pandas.DataFrame.iterrow()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html) ```python= gen = poker.iterrows() first_element = next(gen) first_element[0] # 0 first_element[1] # S1 2 # R1 11 # S2 2 # ... ``` ### The iterrows() function for looping a DataFrame > 我們也可以使用iterrows()做逐行迭代 ```python= data_generator = poker_hands.iterrows() for index, values in data_generator: if index % 2 == 1: hand_sum = sum([values[1], values[3], values[5], values[7], values[9]]) ``` * 大家要注意，絕對不要使用 `for i in range(len(df)): iloc[i,:]`去對每個row迭代，使用iterrows()會更有效率。 ## 2. Looping using the .apply() functionc ![](https://i.imgur.com/FZfp3gz.png) ### The .apply() function for rows | apply() * [pandas.DataFrame.apply()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) * [pandas.Series.apply()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) > 計算這五個欄位，`每個row` 的sum()，我們可以發現apply() 的表現優異，速率差到快4倍。 >apply ```python= start_time = time.time() poker[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x), axis=1) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 0.636334896088 seconds ``` >iterrows() ```python= start_time = time.time() for ind, value in poker.iterrows(): sum([value[1], value[3], value[5], value[7], value[9]]) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 3.15526986122 seconds ``` * Difference in speed: 395.85051529% ### The .apply() function for columns | sum() > 這邊我們別用`pandas 和 python 的sum()`對 column 做 sum()，可以發現 pandas內建的sum() 既為優秀。 * [pandas.DataFrame.sum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) >apply + python 內建的sum() ```python= start_time = time.time() # 計算欄位總和 poker[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x), axis=0) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 0.00490880012 seconds ``` ```python= start_time = time.time() poker[['R1', 'R1', 'R3', 'R4', 'R5']].sum(axis=0) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 0.00279092788 seconds ``` * Difference in speed: 160.310951649% * 這邊apply()就多於了 ### Use of .apply() function in every cell > 你也可以對dataframe 裡的每個儲存格做一樣的動作，但是此動作是要可行且獨立的，也就是每個儲存格都能透過 function(或lambda) 得到對應的值。 > 這邊我們對dataframe 裡面每個儲存個做平方的動作。 ```python= get_square = lambda x: x**2 data_sum = poker_hands.apply(get_square) print(data_sum.head()) #... ``` ## 3. Vectorization over Pandas series ![](https://i.imgur.com/XRQWE8B.png) ### How to perform pandas vectorization(向量化) > 向量就是一串數字，而這邊我們對每個row使用sum()，而sum()輸出的值只有一個，最後的到的結果就是一個`Series(column)` 也就是一個向量 ```python= start_time = time.time() poker[['R1', 'R2', 'R3', 'R4', 'R5']].sum(axis=1) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 0.0026819705 seconds poker[['R1', 'R2', 'R3', 'R4', 'R5']].sum(axis=1).head() ``` >output | || |-------------|----| |0 | 47 | |1 | 47 | |2 | 47 | |3 | 47 | |4 | 47 | |- |- | | dtype: int64| | ### Comparison to the previous methods > 效率: vectorization >.apply() >.iterows() >iterrows() ```python= data_generator = data.iterrows() start_time = time.time() for index, value in data_generator: sum([value[1], value[3], value[5], value[7]]) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 3.37918996 seconds ``` >apply() ```python= start_time = time.time() data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x),axis=1) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 0.637711048 seconds ``` * Difference between vectorization and the `.iterows()` function: 111,800.75% * Difference between vectorization and the `.apply()` function: 20,853% ## 4. Vectorization with NumPy arrays using .values() ### NumPy in pandas | numpy * [Numpy tutorial](https://www.numpy.org/devdocs/user/quickstart.html) > 用 np.int8 對column 指定特別的格式 ```python= df = pd.DataFrame({'Col1':[0, 1, 2]}, dtype=np.int8) print(df) ``` >output | |Col1| |---------|----| |0 | 0 | |1 | 1 | |2 | 2 | >使用np.array製造一個array，跟`python的list` 、`pandas的series`很像 ```python= nd = np.array(range(7)) print(nd) # [0 1 2 3 4 5 6] ``` ### How to perform NumPy vectorization | values() * [pandas.Series.values()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.values.html) >在 pandas 裡可以使用 `values()`，將資料轉換成 array的形式，下面我們比較一下有無用 array 的差別。發現會比直接用sum() 快一點 > with values ```python= start_time = time.time() poker[['R1', 'R2', 'R3', 'R4', 'R5']].values.sum(axis=1) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 0.00157618522644 seconds ``` >without values ```python= start_time = time.time() poker[['R1', 'R2', 'R3', 'R4', 'R5']].sum(axis=1) print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 0.00268197059631 seconds ``` * Difference in time: 39.0482% --- # [ch4] Data manipulation for groups using the .groupby() function >This chapter describes the groupby() function and how we can use it to transform values in place, replace missing values and apply complex functions group-wise. ## 1. Data transformation using .groupby().transform | groupby() ![](https://i.imgur.com/CzB4fkK.png) >這邊我們示範最基本的groupby，他對 `smoker`這個行做group(分群)，你可以把`smoker`當作一個變數，在這個變數裡存在著 `yes`和`no` 這兩種值，也就是我們可以依據`smoker`這個變數把整個dataframe做分成`yes`和`no`這兩群。下面我們對分完的`yes`、`no`這兩群資料使用count()，可以看到兩組個別的資料數量。 * [pandas.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) ```python= restaurant_grouped = restaurant.groupby('smoker') print(restaurant_grouped.count()) ``` >output ![](https://i.imgur.com/OgAO6VN.png) * 他是去算有幾筆資料(row)，最後顯示的是一群裡面每個column裡包含的資料數量。因為 count()顯示的是組間的差異，在同一個群裡面的不同變數應當一致。`所以只要沒有空值，同一組(group)裡面每個變數(column)的資料數量是一樣的。` ### Data transformation | transform() >首先我們先創一個lambda ```python= zscore = lambda x: (x - x.mean() ) / x.std() ``` * [pandas.DataFrame.transform()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html) * 相較於apply() ，transform() 限制較多 >再來，我們對`time` groupby() 後得到 `restaurant_grouped` ，接著使用`transform()`對每個group 執行 lambda 的動作 ```python= restaurant_grouped = restaurant.groupby('time') restaurant_transformed = restaurant_grouped.transform(zscore) restaurant_transformed.head() # Results from the first operation calculated in 0.016291141 seconds ``` >output ![](https://i.imgur.com/gfM97r9.png) ### Comparison with native methods ```python= restaurant.groupby('sex').transform(zscore) # Results from the first operation calculated in 0.016291141 seconds mean_female = restaurant.groupby('sex').mean()['total_bill']['Female'] mean_male = restaurant.groupby('sex').mean()['total_bill']['Male'] std_female = restaurant.groupby('sex').std()['total_bill']['Female'] std_male = restaurant.groupby('sex').std()['total_bill']['Male'] for i in range(len(restaurant)): if restaurant.iloc[i][2] == 'Female': restaurant.iloc[i][0] = (restaurant.iloc[i][0] - mean_female)/std_female else: restaurant.iloc[i][0] = (restaurant.iloc[i][0] - mean_male)/std_male # Results from the second operation calculated in 3.937326908 seconds ``` * Difference in time: 24,068.5145% ### The min-max normalization using .transform() ```python= min_max_tr = lambda x: (x - x.min()) / (x.max() - x.min()) restaurant_grouped = restaurant_data.groupby('time') restaurant_min_max_group = restaurant_grouped.transform(min_max_tr) print(restaurant_min_max_group.head()) #... ``` ### Transforming values to probabilities ```python= pois_tr = lambda x: np.exp(-x.mean()*x) * x.mean() restaurant_grouped = restaurant_data.groupby('time') restaurant_pois_group = restaurant_grouped['tip'].transform(pois_tr) print(restaurant_pois_group.head()) ``` ### Validation of normalization transformation ```python= zscore = lambda x: (x - x.mean()) / x.std() poker_trans = poker_grouped.transform(zscore) poker_regrouped = poker_trans.groupby(poker_hands['Class']) print(np.round(poker_regrouped.mean(), 3)) #... print(poker_regrouped.std()) #... ``` ## 2. Missing value imputation using transform() > 我們利用 count()，去找空值 ```python= prior_counts = restaurant.groupby('time')['total_bill'].count() missing_counts = restaurant_nan.groupby('time')['total_bill'].count() print(prior_counts - missing_counts) ``` >output ![](https://i.imgur.com/RA2kw9d.png) ### Missing value imputation | fillna() >我們使用和fillna()，用平均值填補空值 * [pandas.DataFrame.fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) ```python= missing_trans = lambda x: x.fillna(x.mean()) restaurant_nan_grouped = restaurant_nan.groupby('time')['total_bill'] restaurant_nan_grouped.transform(missing_trans) # Results from the above operation calculated in 0.00368881225586 seconds ``` >output ![](https://i.imgur.com/tZX5asP.png) ### Comparison with native methods >使用原本的方式，繞很多路，速度也差很多 ```python= start_time = time.time() mean_din = restaurant_nan.loc[restaurant_nan.time == 'Dinner']['total_bill'].mean() mean_lun = restaurant_nan.loc[restaurant_nan.time == 'Lunch']['total_bill'].mean() for row in range(len(restaurant_nan)): if restaurant_nan.iloc[row]['time'] == 'Dinner': restaurant_nan.loc[row, 'total_time'] = mean_din else: restaurant_nan.loc[row, 'total_time'] = mean_lun print("Results from the above operation calculated in %s seconds" % (time.time() - start_time)) # Results from the above operation calculated in 0.172566890717 seconds ``` * Difference in time: 4,578.115% ### Identifying missing values ```python= restaurant_nan_grouped = restaurant_nan.groupby('smoker') restaurant_nan_nval = restaurant_nan_grouped['tip'].count() print(restaurant_nan_grouped['total_bill'].count() - restaurant_nan_nval) # smoker # No 41 # Yes 24 # dtype: int64 ``` ### Missing value imputation ```python= missing_trans = lambda x: x.fillna(x.median()) restaurant_grouped = restaurant_data.groupby('time') restaurant_impute = restaurant_grouped.transform(missing_trans) print(restaurant_impute.head()) #... ``` ## 3. Data filtration using the lter() function ### Purpose of filter() * Limit results based on an aggregate feature 1. Number of missing values 1. Mean of a specic feature 1. Number of occurrences ofthe group ### Filter using groupby().filter() | filter() >我們可以發現filter()比原來的方式更快。 * [pandas.DataFrame.filter()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html) * 它是完全針對index_label 去篩選清算成本。 ```python= restaurant_grouped = restaurant.groupby('day') filter_trans = lambda x : x['total_bill'].mean() > 20 restaurant_filtered = restaurant_grouped.filter(filter_trans) # Results from the above operation calculated in 0.00414085388184 seconds print(restaurant_filtered['tip'].mean()) # 3.11527607362 print(restaurant['tip'].mean()) # 2.9982786885245902 ``` ### Comparison with native methods ```python= t=[restaurant.loc[df['day'] == i]['tip'] for i in restaurant['day'].unique() if restaurant.loc[df['day'] == i]['total_bill'].mean()>20] restaurant_filtered = t[0] for j in t[1:]: restaurant_filtered = restaurant_filtered.append(j,ignore_index=True) # Results from the above operation calculated in 0.00663900375366 seconds print(restaurant_filtered.mean()) # 3.11527607362 ``` * Difference in time: 60.329341317157024%