TOPIC - HackMD

# TOPIC - Trouble when using Pandas - Pandas & Polars ## Pandas DataFrame ```python import pandas as pd import numpy as np print(pd.__version__) data = {"x": 2**np.arange(5), "y": 3**np.arange(5), "z": np.array([45, 98, 24, 11, 64])} index = ["a", "b", "c", "d", "e"] df = pd.DataFrame(data=data, index=index) df ``` 2.2.0 <div> </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>x</th> <th>y</th> <th>z</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1</td> <td>1</td> <td>45</td> </tr> <tr> <th>b</th> <td>2</td> <td>3</td> <td>98</td> </tr> <tr> <th>c</th> <td>4</td> <td>9</td> <td>24</td> </tr> <tr> <th>d</th> <td>8</td> <td>27</td> <td>11</td> </tr> <tr> <th>e</th> <td>16</td> <td>81</td> <td>64</td> </tr> </tbody> </table> </div> Main information contained in the DataFrame: - data - column labels - row labels - some metadata, including shape, data types, and so on ```python mask = df["z"] < 50 print(f"type of mask: {type(mask)}") print("-----------------") print(mask) print("-----------------") df[mask] ``` type of mask: <class 'pandas.core.series.Series'> ----------------- a True b False c True d True e False Name: z, dtype: bool ----------------- <div> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>x</th> <th>y</th> <th>z</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1</td> <td>1</td> <td>45</td> </tr> <tr> <th>c</th> <td>4</td> <td>9</td> <td>24</td> </tr> <tr> <th>d</th> <td>8</td> <td>27</td> <td>11</td> </tr> </tbody> </table> </div> ## Why use pandas? 1. 幫助計算分析資料 aggregate function, pivot table 2. 減少hit db次數 ### Where I get into trouble Something related to `generate_webpopup_report.py` Main process of this command: 1. load all data we need one time as a dataframe (`datetime`, `event_action`, `user`, `session`,...) 2. iterate each `template`, filter the dataframe ```python for template_uid, template_info in template_infos.items(): impression_df = base_event_df[ (base_event_df["event_value"] == template_uid) & (base_event_df["event_action"] == EventActionType.WEB_POPUP_IMPRESSION.value) ].drop_duplicates(subset="event_label", keep="first") ... # click_df = .... # cloas_df = .... # calculate conversion data... # 區分：會員 / 匿名 ``` ## Views and Copies in NumPy and pandas sometimes data is copied from one part of memory to another, but in other cases two or more objects can share the same data, saving both time and **memory**. ### Understanding Views and Copies in NumPy ```python import numpy as np arr = np.array([1, 2, 4, 8, 16, 32]) print(arr) print('\n') print(f"arr[1:4:2] -> {arr[1:4:2]}") print('\n') print(f"arr[[1, 3]] -> {arr[[1, 3]]}") ``` [ 1 2 4 8 16 32] arr[1:4:2] -> [2 8] arr[[1, 3]] -> [2 8] #### Different behavior under the surface `arr[1:4:2]` -> shallow copy (`view`) `arr[[1, 3]` -> deep copy (or just `copy`) ##### Check by: `.base`: [Base object if memory is from some other object.](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.base.html) 如果 object array 是擁有自己的 memory -> `None`, 否，則指向參照的 object `.flags.OWNDATA`: [Information about the memory layout of the array.](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flags.html) The array owns the memory it uses or borrows it from another object. ```python print('------arr[1:4:2]------') print(arr[1:4:2].base, f"(base is arr={arr[1:4:2].base is arr})",) print(arr[1:4:2].flags.owndata) print('------arr[[1, 3]]------') print(arr[[1, 3]].base) print(arr[[1, 3]].flags.owndata) ``` ------arr[1:4:2]------ [ 1 2 4 8 16 32] (base is arr=True) False ------arr[[1, 3]]------ None True ### Understanding Views and Copies in pandas - `.copy(deep=True)` # default - `.copy(deep=False)` -------------------------------------------- - `loc[:, :]`: Label-based Indexer - `iloc[:,:]`: Integer-location based Indexer ```python df = pd.DataFrame(data=data, index=index) print('\n-----df-------') print(df) print('\n-----deep_copy_of_df-------') deep_copy_of_df = df.copy(deep=True) print(deep_copy_of_df) print('\n-----shallow_copy_of_df-------') shallow_copy_of_df = df.copy(deep=False) print(shallow_copy_of_df) print('\n-----loc_of_df-------') loc_of_df = df.loc[:, :] print(loc_of_df) print('\n------iloc_of_df------') iloc_of_df = df.iloc[:, :] print(iloc_of_df) ``` -----df------- x y z a 1 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 -----deep_copy_of_df------- x y z a 1 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 -----shallow_copy_of_df------- x y z a 1 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 -----loc_of_df------- x y z a 1 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 ------iloc_of_df------ x y z a 1 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 ```python df.iloc[0, 0] = 10000 print('\n------df------') print(df) print('\n------deep_copy_of_df------') print(deep_copy_of_df) print('\n------shallow_copy_of_df------') print(shallow_copy_of_df) print('\n-----loc_of_df-------') print(loc_of_df) print('\n-----iloc_of_df-------') print(iloc_of_df) ``` ------df------ x y z a 10000 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 ------deep_copy_of_df------ x y z a 1 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 ------shallow_copy_of_df------ x y z a 10000 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 -----loc_of_df------- x y z a 10000 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 -----iloc_of_df------- x y z a 10000 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 ```python # Using np.shares_memory: If the data in the DataFrames is stored as a NumPy array (which is common in pandas), # you can use np.shares_memory to check if two arrays share the same data in memory. print('====Data share same memory?====') print(np.shares_memory(df.values, deep_copy_of_df.values)) print(np.shares_memory(df.values, shallow_copy_of_df.values)) print(np.shares_memory(df.values, loc_of_df.values)) print(np.shares_memory(df.values, iloc_of_df.values)) ``` ====Data share same memory?==== False True True True ### copy_on_write **`CoW` -> use a lazy copy mechanism to defer the copy and ignore the copy keyword** Default for pandas3 Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the optimizations that become possible through CoW are implemented and supported. All possible optimizations are supported starting from pandas 2.1. ```python df = pd.DataFrame(data=data, index=index) # This will be True by default for pandas3 with pd.option_context("mode.copy_on_write", True): # same operations as above deep_copy_of_df = df.copy(deep=True) shallow_copy_of_df = df.copy(deep=False) loc_of_df = df.loc[:, :] iloc_of_df = df.iloc[:, :] df.iloc[0, 0] = 10000 ``` ```python print('\n------df------') print(df) print('\n------deep_copy_of_df------') print(deep_copy_of_df) print('\n------shallow_copy_of_df------') print(shallow_copy_of_df) print('\n-----loc_of_df-------') print(loc_of_df) print('\n-----iloc_of_df-------') print(iloc_of_df) ``` ------df------ x y z a 10000 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 ------deep_copy_of_df------ x y z a 1 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 ------shallow_copy_of_df------ x y z a 1 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 -----loc_of_df------- x y z a 1 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 -----iloc_of_df------- x y z a 1 1 45 b 2 3 98 c 4 9 24 d 8 27 11 e 16 81 64 ```python print('====Data share same memory?====') print(np.shares_memory(df.values, deep_copy_of_df.values)) print(np.shares_memory(df.values, shallow_copy_of_df.values)) print(np.shares_memory(df.values, loc_of_df.values)) print(np.shares_memory(df.values, iloc_of_df.values)) ``` ====Data share same memory?==== False False False False ### query return copy or view? pandas2.2 (OS use now) ```python import numpy as np import pandas as pd data = {"x": 2**np.arange(5), "y": 3**np.arange(5), "z": np.array([45, 98, 24, 11, 64])} index = ["a", "b", "c", "d", "e"] df = pd.DataFrame(data=data, index=index) df ``` <div> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>x</th> <th>y</th> <th>z</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1</td> <td>1</td> <td>45</td> </tr> <tr> <th>b</th> <td>2</td> <td>3</td> <td>98</td> </tr> <tr> <th>c</th> <td>4</td> <td>9</td> <td>24</td> </tr> <tr> <th>d</th> <td>8</td> <td>27</td> <td>11</td> </tr> <tr> <th>e</th> <td>16</td> <td>81</td> <td>64</td> </tr> </tbody> </table> </div> ```python with pd.option_context("mode.copy_on_write", False): view = df.query('x > 0') print(view.to_numpy().base is df.to_numpy().base) print(np.shares_memory(df.to_numpy(), view.to_numpy())) view ``` False False <div> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>x</th> <th>y</th> <th>z</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1</td> <td>1</td> <td>45</td> </tr> <tr> <th>b</th> <td>2</td> <td>3</td> <td>98</td> </tr> <tr> <th>c</th> <td>4</td> <td>9</td> <td>24</td> </tr> <tr> <th>d</th> <td>8</td> <td>27</td> <td>11</td> </tr> <tr> <th>e</th> <td>16</td> <td>81</td> <td>64</td> </tr> </tbody> </table> </div> ```python with pd.option_context("mode.copy_on_write", True): view = df.query('x > 0') print(view.to_numpy().base is df.to_numpy().base) print(np.shares_memory(df.to_numpy(), view.to_numpy())) view ``` True True <div> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>x</th> <th>y</th> <th>z</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1</td> <td>1</td> <td>45</td> </tr> <tr> <th>b</th> <td>2</td> <td>3</td> <td>98</td> </tr> <tr> <th>c</th> <td>4</td> <td>9</td> <td>24</td> </tr> <tr> <th>d</th> <td>8</td> <td>27</td> <td>11</td> </tr> <tr> <th>e</th> <td>16</td> <td>81</td> <td>64</td> </tr> </tbody> </table> </div> ```python with pd.option_context("mode.copy_on_write", False): view = df.loc[df.x>0,:] print(view.to_numpy().base is df.to_numpy().base) print(np.shares_memory(df.to_numpy(), view.to_numpy())) view ``` False False <div> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>x</th> <th>y</th> <th>z</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1</td> <td>1</td> <td>45</td> </tr> <tr> <th>b</th> <td>2</td> <td>3</td> <td>98</td> </tr> <tr> <th>c</th> <td>4</td> <td>9</td> <td>24</td> </tr> <tr> <th>d</th> <td>8</td> <td>27</td> <td>11</td> </tr> <tr> <th>e</th> <td>16</td> <td>81</td> <td>64</td> </tr> </tbody> </table> </div> ```python with pd.option_context("mode.copy_on_write", True): view = df.loc[df.x>0,:] print(view.to_numpy().base is df.to_numpy().base) print(np.shares_memory(df.to_numpy(), view.to_numpy())) view ``` True True <div> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>x</th> <th>y</th> <th>z</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1</td> <td>1</td> <td>45</td> </tr> <tr> <th>b</th> <td>2</td> <td>3</td> <td>98</td> </tr> <tr> <th>c</th> <td>4</td> <td>9</td> <td>24</td> </tr> <tr> <th>d</th> <td>8</td> <td>27</td> <td>11</td> </tr> <tr> <th>e</th> <td>16</td> <td>81</td> <td>64</td> </tr> </tbody> </table> </div> ### Conclusion on Using Pandas - Be careful not to accidentally copy large dataframes. - `.query` - `df[df["A"] == 1]` - Sometimes it's confused, since there are so many way to access the same data. - `df["A"]` - `df.loc[:, "A"]` - `df.iloc[:, 0]` So, that's try other tools! maybe start from [Polars](https://hackmd.io/@WendyYeh/r1UCiG9Y0)?