High-Performance Pandas: eval() and query()

# High-Performance Pandas: eval() and query() > Lee Tsung-Tang > ###### tags: `python` `pandas` `Python Data Science Handbook` [Python Data Science Handbook CH3](https://jakevdp.github.io/PythonDataScienceHandbook/) [Enhancing performance](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html) [TOC] {%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %} As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the `eval()` and `query()` functions, which rely on the [Numexpr](https://github.com/pydata/numexpr) package. ## Motivating query() and eval(): Compound Expressions > `pandas` 與 `numpy` 都提供許多方便的vectorized operations: > ```python= import numpy as np rng = np.random.RandomState(42) x = rng.rand(1000000) y = rng.rand(1000000) %timeit x + y # 100 loops, best of 3: 3.39 ms per loop ``` :new_moon: 請參考[Computation on NumPy Arrays: Universal Functions](/jThpTB_KTTmZqYzP0CRkWg) > 向量化操作的速度比起迴圈要快的多 ```python= %timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x)) # 1 loop, best of 3: 266 ms per loop ``` > 但如果包含複合地表達式則會降低效率 > 如： ```python= mask = (x > 0.5) & (y < 0.5) ``` > Because `NumPy` evaluates each subexpression, this is roughly equivalent to the following: ```python= tmp1 = (x > 0.5) tmp2 = (y < 0.5) mask = tmp1 & tmp2 ``` :notebook: 因為每個步驟都會暫存於記憶題，因此如果`array`很大時暫存資料的存儲跟讀寫負擔就會很大。 > `Numexpr` 計算則不需要如此，能直接進行計算(詳細可以參考[numexpr文件](https://github.com/pydata/numexpr)) ```python= import numexpr mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)') np.allclose(mask, mask_numexpr) # True ``` > The Pandas `eval()` and `query()` tools that we will discuss here are conceptually similar, and depend on the `Numexpr` package. > ## `pandas.eval()` for Efficient Operations > The `eval()` function in Pandas uses ==string expressions== to efficiently compute operations using DataFrames. ```python= import pandas as pd nrows, ncols = 100000, 100 rng = np.random.RandomState(42) df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)) for i in range(4)) ``` > 用pandas的方法加總四個DF ```python= %timeit df1 + df2 + df3 + df4 # 10 loops, best of 3: 87.1 ms per loop ``` > `pd.eval` by constructing the expression as a string: ```python= %timeit pd.eval('df1 + df2 + df3 + df4') # 10 loops, best of 3: 42.2 ms per loop ``` :waning_crescent_moon: eval() version速度快很多，而且記憶體使用也更少 > 兩者的結果一致 ```python= np.allclose(df1 + df2 + df3 + df4, pd.eval('df1 + df2 + df3 + df4')) # True ``` ### Operations supported by `pd.eval()` ```python= df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3))) for i in range(5)) ``` #### Arithmetic operators > `pd.eval()` 可以用於數學運算 ```python= result1 = -df1 * df2 / (df3 + df4) - df5 result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5') np.allclose(result1, result2) # True ``` #### Comparison operators > `pd.eval()` 用於comparison operators ```python= result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4) result2 = pd.eval('df1 < df2 <= df3 != df4') np.allclose(result1, result2) # True ``` #### Bitwise operators > `pd.eval()` supports the & and | bitwise operators: ```python= result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4) result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)') np.allclose(result1, result2) # True ``` > `and` 與 `or` 的表達式也同樣支持 ```python= result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)') np.allclose(result1, result3) # True ``` #### Object attributes and indices > `pd.eval()` 甚至可以支持如`obj.attr` 或者 `obj[index]`調用attribute/indexes syntax: ```python= result1 = df2.T[0] + df3.iloc[1] result2 = pd.eval('df2.T[0] + df3.iloc[1]') np.allclose(result1, result2) # True ``` #### Other operations Other operations such as <font color=red>function calls, conditional statements, loops, and other more involved constructs</font> are currently ==not== implemented in `pd.eval()`. If you'd like to execute these more complicated types of expressions, you can use the `Numexpr` library itself. ## `DataFrame.eval()` for Column-Wise Operations `DataFrames` 本身也支持 `eval()` method > `eval()` method 最大的特色是可以直接用columna name調用column ```python= df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C']) df.head() ``` ![](https://i.imgur.com/0EZUHH6.png) > Using `pd.eval()` as above, we can compute expressions with the three columns like this: ```python= result1 = (df['A'] + df['B']) / (df['C'] - 1) result2 = pd.eval("(df.A + df.B) / (df.C - 1)") np.allclose(result1, result2) # True ``` > `DataFrame.eval()` 進行上面的運算時更佳方便 ```python= result3 = df.eval('(A + B) / (C - 1)') np.allclose(result1, result3) # True ``` :notebook: `eval()` method 中，column name在evaluated expression中被視為獨立的variable，可以直接取用 ### Assignment in `DataFrame.eval()` > `DataFrame.eval()` also allows ==assignment to any column== ```python= df.head() ``` ![](https://i.imgur.com/PiazDB7.png) > `df.eval()` 對特定欄位運算後 create column "D" ```python= df.eval('D = (A + B) / C', inplace=True) df.head() ``` ![](https://i.imgur.com/44TDKQ2.png) > 也可也用於modified existed column ```python= df.eval('D = (A - B) / C', inplace=True) df.head() ``` ![](https://i.imgur.com/pWnGAk4.png) ### Local variables in `DataFrame.eval()` > `DataFrame.eval()` 可以用`@`調用local variable ```python= column_mean = df.mean(1) result1 = df['A'] + column_mean result2 = df.eval('A + @column_mean') np.allclose(result1, result2) # True ``` :waning_crescent_moon: 藉由`@`可以調用兩個不同的namespaces: namespace of columns; namespace of python objects :::warning `@` character 只支持`DataFrame.eval()` method，不支持`pandas.eval()`。因為後者指支持python的namespace ::: ## `DataFrame.query()` Method ```python= result1 = df[(df.A < 0.5) & (df.B < 0.5)] result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]') np.allclose(result1, result2) # True ``` > 此例中，因為expression中涉及DF column，無法直接以`DataFrame.eval()`操作。不過此時可以改用`query()` method ```python= result2 = df.query('A < 0.5 and B < 0.5') np.allclose(result1, result2) # True ``` > `query()` 也支持`@`調用local variable ```python= Cmean = df['C'].mean() result1 = df[(df.A < Cmean) & (df.B < Cmean)] result2 = df.query('A < @Cmean and B < @Cmean') np.allclose(result1, result2) # True ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.