owned this note
owned this note
Published
Linked with GitHub
# High-Performance Pandas: eval() and query()
> Lee Tsung-Tang
> ###### tags: `python` `pandas` `Python Data Science Handbook`
[Python Data Science Handbook CH3](https://jakevdp.github.io/PythonDataScienceHandbook/)
[Enhancing performance](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html)
[TOC]
{%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %}
As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the `eval()` and `query()` functions, which rely on the [Numexpr](https://github.com/pydata/numexpr) package.
## Motivating query() and eval(): Compound Expressions
> `pandas` 與 `numpy` 都提供許多方便的vectorized operations:
>
```python=
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y
# 100 loops, best of 3: 3.39 ms per loop
```
:new_moon: 請參考[Computation on NumPy Arrays: Universal Functions](/jThpTB_KTTmZqYzP0CRkWg)
> 向量化操作的速度比起迴圈要快的多
```python=
%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))
# 1 loop, best of 3: 266 ms per loop
```
> 但如果包含複合地表達式則會降低效率
> 如:
```python=
mask = (x > 0.5) & (y < 0.5)
```
> Because `NumPy` evaluates each subexpression, this is roughly equivalent to the following:
```python=
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2
```
:notebook: 因為每個步驟都會暫存於記憶題,因此如果`array`很大時暫存資料的存儲跟讀寫負擔就會很大。
> `Numexpr` 計算則不需要如此,能直接進行計算(詳細可以參考[numexpr文件](https://github.com/pydata/numexpr))
```python=
import numexpr
mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)
# True
```
> The Pandas `eval()` and `query()` tools that we will discuss here are conceptually similar, and depend on the `Numexpr` package.
>
## `pandas.eval()` for Efficient Operations
> The `eval()` function in Pandas uses ==string expressions== to efficiently compute operations using DataFrames.
```python=
import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
for i in range(4))
```
> 用pandas的方法加總四個DF
```python=
%timeit df1 + df2 + df3 + df4
# 10 loops, best of 3: 87.1 ms per loop
```
> `pd.eval` by constructing the expression as a string:
```python=
%timeit pd.eval('df1 + df2 + df3 + df4')
# 10 loops, best of 3: 42.2 ms per loop
```
:waning_crescent_moon: eval() version速度快很多,而且記憶體使用也更少
> 兩者的結果一致
```python=
np.allclose(df1 + df2 + df3 + df4,
pd.eval('df1 + df2 + df3 + df4'))
# True
```
### Operations supported by `pd.eval()`
```python=
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
for i in range(5))
```
#### Arithmetic operators
> `pd.eval()` 可以用於數學運算
```python=
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)
# True
```
#### Comparison operators
> `pd.eval()` 用於comparison operators
```python=
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)
# True
```
#### Bitwise operators
> `pd.eval()` supports the & and | bitwise operators:
```python=
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)
# True
```
> `and` 與 `or` 的表達式也同樣支持
```python=
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)
# True
```
#### Object attributes and indices
> `pd.eval()` 甚至可以支持如`obj.attr` 或者 `obj[index]`調用attribute/indexes syntax:
```python=
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)
# True
```
#### Other operations
Other operations such as <font color=red>function calls, conditional statements, loops, and other more involved constructs</font> are currently ==not== implemented in `pd.eval()`. If you'd like to execute these more complicated types of expressions, you can use the `Numexpr` library itself.
## `DataFrame.eval()` for Column-Wise Operations
`DataFrames` 本身也支持 `eval()` method
> `eval()` method 最大的特色是可以直接用columna name調用column
```python=
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()
```

> Using `pd.eval()` as above, we can compute expressions with the three columns like this:
```python=
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)
# True
```
> `DataFrame.eval()` 進行上面的運算時更佳方便
```python=
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)
# True
```
:notebook: `eval()` method 中,column name在evaluated expression中被視為獨立的variable,可以直接取用
### Assignment in `DataFrame.eval()`
> `DataFrame.eval()` also allows ==assignment to any column==
```python=
df.head()
```

> `df.eval()` 對特定欄位運算後 create column "D"
```python=
df.eval('D = (A + B) / C', inplace=True)
df.head()
```

> 也可也用於modified existed column
```python=
df.eval('D = (A - B) / C', inplace=True)
df.head()
```

### Local variables in `DataFrame.eval()`
> `DataFrame.eval()` 可以用`@`調用local variable
```python=
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)
# True
```
:waning_crescent_moon: 藉由`@`可以調用兩個不同的namespaces: namespace of columns; namespace of python objects
:::warning
`@` character 只支持`DataFrame.eval()` method,不支持`pandas.eval()`。因為後者指支持python的namespace
:::
## `DataFrame.query()` Method
```python=
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)
# True
```
> 此例中,因為expression中涉及DF column,無法直接以`DataFrame.eval()`操作。不過此時可以改用`query()` method
```python=
result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)
# True
```
> `query()` 也支持`@`調用local variable
```python=
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)
# True
```