# TOPIC
- Trouble when using Pandas
- Pandas & Polars
## Pandas DataFrame
```python
import pandas as pd
import numpy as np
print(pd.__version__)
data = {"x": 2**np.arange(5),
"y": 3**np.arange(5),
"z": np.array([45, 98, 24, 11, 64])}
index = ["a", "b", "c", "d", "e"]
df = pd.DataFrame(data=data, index=index)
df
```
2.2.0
<div>
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
<th>z</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1</td>
<td>1</td>
<td>45</td>
</tr>
<tr>
<th>b</th>
<td>2</td>
<td>3</td>
<td>98</td>
</tr>
<tr>
<th>c</th>
<td>4</td>
<td>9</td>
<td>24</td>
</tr>
<tr>
<th>d</th>
<td>8</td>
<td>27</td>
<td>11</td>
</tr>
<tr>
<th>e</th>
<td>16</td>
<td>81</td>
<td>64</td>
</tr>
</tbody>
</table>
</div>
Main information contained in the DataFrame:
- data
- column labels
- row labels
- some metadata, including shape, data types, and so on
```python
mask = df["z"] < 50
print(f"type of mask: {type(mask)}")
print("-----------------")
print(mask)
print("-----------------")
df[mask]
```
type of mask: <class 'pandas.core.series.Series'>
-----------------
a True
b False
c True
d True
e False
Name: z, dtype: bool
-----------------
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
<th>z</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1</td>
<td>1</td>
<td>45</td>
</tr>
<tr>
<th>c</th>
<td>4</td>
<td>9</td>
<td>24</td>
</tr>
<tr>
<th>d</th>
<td>8</td>
<td>27</td>
<td>11</td>
</tr>
</tbody>
</table>
</div>
## Why use pandas?
1. 幫助計算分析資料 aggregate function, pivot table
2. 減少hit db次數
### Where I get into trouble
Something related to `generate_webpopup_report.py`
Main process of this command:
1. load all data we need one time as a dataframe (`datetime`, `event_action`, `user`, `session`,...)
2. iterate each `template`, filter the dataframe
```python
for template_uid, template_info in template_infos.items():
impression_df = base_event_df[
(base_event_df["event_value"] == template_uid)
& (base_event_df["event_action"] == EventActionType.WEB_POPUP_IMPRESSION.value)
].drop_duplicates(subset="event_label", keep="first")
...
# click_df = ....
# cloas_df = ....
# calculate conversion data...
# 區分: 會員 / 匿名
```
## Views and Copies in NumPy and pandas
sometimes data is copied from one part of memory to another,
but in other cases two or more objects can share the same data, saving both time and **memory**.
### Understanding Views and Copies in NumPy
```python
import numpy as np
arr = np.array([1, 2, 4, 8, 16, 32])
print(arr)
print('\n')
print(f"arr[1:4:2] -> {arr[1:4:2]}")
print('\n')
print(f"arr[[1, 3]] -> {arr[[1, 3]]}")
```
[ 1 2 4 8 16 32]
arr[1:4:2] -> [2 8]
arr[[1, 3]] -> [2 8]
#### Different behavior under the surface
`arr[1:4:2]` -> shallow copy (`view`)
`arr[[1, 3]` -> deep copy (or just `copy`)
##### Check by:
`.base`: [Base object if memory is from some other object.](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.base.html)
如果 object array 是擁有自己的 memory -> `None`, 否,則指向 參照的 object
`.flags.OWNDATA`: [Information about the memory layout of the array.](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flags.html)
The array owns the memory it uses or borrows it from another object.
```python
print('------arr[1:4:2]------')
print(arr[1:4:2].base, f"(base is arr={arr[1:4:2].base is arr})",)
print(arr[1:4:2].flags.owndata)
print('------arr[[1, 3]]------')
print(arr[[1, 3]].base)
print(arr[[1, 3]].flags.owndata)
```
------arr[1:4:2]------
[ 1 2 4 8 16 32] (base is arr=True)
False
------arr[[1, 3]]------
None
True
### Understanding Views and Copies in pandas
- `.copy(deep=True)` # default
- `.copy(deep=False)`
--------------------------------------------
- `loc[:, :]`: Label-based Indexer
- `iloc[:,:]`: Integer-location based Indexer
```python
df = pd.DataFrame(data=data, index=index)
print('\n-----df-------')
print(df)
print('\n-----deep_copy_of_df-------')
deep_copy_of_df = df.copy(deep=True)
print(deep_copy_of_df)
print('\n-----shallow_copy_of_df-------')
shallow_copy_of_df = df.copy(deep=False)
print(shallow_copy_of_df)
print('\n-----loc_of_df-------')
loc_of_df = df.loc[:, :]
print(loc_of_df)
print('\n------iloc_of_df------')
iloc_of_df = df.iloc[:, :]
print(iloc_of_df)
```
-----df-------
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
-----deep_copy_of_df-------
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
-----shallow_copy_of_df-------
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
-----loc_of_df-------
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
------iloc_of_df------
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
```python
df.iloc[0, 0] = 10000
print('\n------df------')
print(df)
print('\n------deep_copy_of_df------')
print(deep_copy_of_df)
print('\n------shallow_copy_of_df------')
print(shallow_copy_of_df)
print('\n-----loc_of_df-------')
print(loc_of_df)
print('\n-----iloc_of_df-------')
print(iloc_of_df)
```
------df------
x y z
a 10000 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
------deep_copy_of_df------
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
------shallow_copy_of_df------
x y z
a 10000 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
-----loc_of_df-------
x y z
a 10000 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
-----iloc_of_df-------
x y z
a 10000 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
```python
# Using np.shares_memory: If the data in the DataFrames is stored as a NumPy array (which is common in pandas),
# you can use np.shares_memory to check if two arrays share the same data in memory.
print('====Data share same memory?====')
print(np.shares_memory(df.values, deep_copy_of_df.values))
print(np.shares_memory(df.values, shallow_copy_of_df.values))
print(np.shares_memory(df.values, loc_of_df.values))
print(np.shares_memory(df.values, iloc_of_df.values))
```
====Data share same memory?====
False
True
True
True
### copy_on_write
**`CoW` -> use a lazy copy mechanism to defer the copy and ignore the copy keyword**
Default for pandas3
Copy-on-Write was first introduced in version 1.5.0.
Starting from version 2.0 most of the optimizations that become possible through CoW are implemented and supported.
All possible optimizations are supported starting from pandas 2.1.
```python
df = pd.DataFrame(data=data, index=index)
# This will be True by default for pandas3
with pd.option_context("mode.copy_on_write", True):
# same operations as above
deep_copy_of_df = df.copy(deep=True)
shallow_copy_of_df = df.copy(deep=False)
loc_of_df = df.loc[:, :]
iloc_of_df = df.iloc[:, :]
df.iloc[0, 0] = 10000
```
```python
print('\n------df------')
print(df)
print('\n------deep_copy_of_df------')
print(deep_copy_of_df)
print('\n------shallow_copy_of_df------')
print(shallow_copy_of_df)
print('\n-----loc_of_df-------')
print(loc_of_df)
print('\n-----iloc_of_df-------')
print(iloc_of_df)
```
------df------
x y z
a 10000 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
------deep_copy_of_df------
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
------shallow_copy_of_df------
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
-----loc_of_df-------
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
-----iloc_of_df-------
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
```python
print('====Data share same memory?====')
print(np.shares_memory(df.values, deep_copy_of_df.values))
print(np.shares_memory(df.values, shallow_copy_of_df.values))
print(np.shares_memory(df.values, loc_of_df.values))
print(np.shares_memory(df.values, iloc_of_df.values))
```
====Data share same memory?====
False
False
False
False
### query return copy or view?
pandas2.2 (OS use now)
```python
import numpy as np
import pandas as pd
data = {"x": 2**np.arange(5),
"y": 3**np.arange(5),
"z": np.array([45, 98, 24, 11, 64])}
index = ["a", "b", "c", "d", "e"]
df = pd.DataFrame(data=data, index=index)
df
```
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
<th>z</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1</td>
<td>1</td>
<td>45</td>
</tr>
<tr>
<th>b</th>
<td>2</td>
<td>3</td>
<td>98</td>
</tr>
<tr>
<th>c</th>
<td>4</td>
<td>9</td>
<td>24</td>
</tr>
<tr>
<th>d</th>
<td>8</td>
<td>27</td>
<td>11</td>
</tr>
<tr>
<th>e</th>
<td>16</td>
<td>81</td>
<td>64</td>
</tr>
</tbody>
</table>
</div>
```python
with pd.option_context("mode.copy_on_write", False):
view = df.query('x > 0')
print(view.to_numpy().base is df.to_numpy().base)
print(np.shares_memory(df.to_numpy(), view.to_numpy()))
view
```
False
False
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
<th>z</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1</td>
<td>1</td>
<td>45</td>
</tr>
<tr>
<th>b</th>
<td>2</td>
<td>3</td>
<td>98</td>
</tr>
<tr>
<th>c</th>
<td>4</td>
<td>9</td>
<td>24</td>
</tr>
<tr>
<th>d</th>
<td>8</td>
<td>27</td>
<td>11</td>
</tr>
<tr>
<th>e</th>
<td>16</td>
<td>81</td>
<td>64</td>
</tr>
</tbody>
</table>
</div>
```python
with pd.option_context("mode.copy_on_write", True):
view = df.query('x > 0')
print(view.to_numpy().base is df.to_numpy().base)
print(np.shares_memory(df.to_numpy(), view.to_numpy()))
view
```
True
True
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
<th>z</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1</td>
<td>1</td>
<td>45</td>
</tr>
<tr>
<th>b</th>
<td>2</td>
<td>3</td>
<td>98</td>
</tr>
<tr>
<th>c</th>
<td>4</td>
<td>9</td>
<td>24</td>
</tr>
<tr>
<th>d</th>
<td>8</td>
<td>27</td>
<td>11</td>
</tr>
<tr>
<th>e</th>
<td>16</td>
<td>81</td>
<td>64</td>
</tr>
</tbody>
</table>
</div>
```python
with pd.option_context("mode.copy_on_write", False):
view = df.loc[df.x>0,:]
print(view.to_numpy().base is df.to_numpy().base)
print(np.shares_memory(df.to_numpy(), view.to_numpy()))
view
```
False
False
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
<th>z</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1</td>
<td>1</td>
<td>45</td>
</tr>
<tr>
<th>b</th>
<td>2</td>
<td>3</td>
<td>98</td>
</tr>
<tr>
<th>c</th>
<td>4</td>
<td>9</td>
<td>24</td>
</tr>
<tr>
<th>d</th>
<td>8</td>
<td>27</td>
<td>11</td>
</tr>
<tr>
<th>e</th>
<td>16</td>
<td>81</td>
<td>64</td>
</tr>
</tbody>
</table>
</div>
```python
with pd.option_context("mode.copy_on_write", True):
view = df.loc[df.x>0,:]
print(view.to_numpy().base is df.to_numpy().base)
print(np.shares_memory(df.to_numpy(), view.to_numpy()))
view
```
True
True
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
<th>z</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1</td>
<td>1</td>
<td>45</td>
</tr>
<tr>
<th>b</th>
<td>2</td>
<td>3</td>
<td>98</td>
</tr>
<tr>
<th>c</th>
<td>4</td>
<td>9</td>
<td>24</td>
</tr>
<tr>
<th>d</th>
<td>8</td>
<td>27</td>
<td>11</td>
</tr>
<tr>
<th>e</th>
<td>16</td>
<td>81</td>
<td>64</td>
</tr>
</tbody>
</table>
</div>
### Conclusion on Using Pandas
- Be careful not to accidentally copy large dataframes.
- `.query`
- `df[df["A"] == 1]`
- Sometimes it's confused, since there are so many way to access the same data.
- `df["A"]`
- `df.loc[:, "A"]`
- `df.iloc[:, 0]`
So, that's try other tools! maybe start from [Polars](https://hackmd.io/@WendyYeh/r1UCiG9Y0)?