# Vectorized String Operations
> Lee Tsung-Tang
> ###### tags: `python` `pandas` `vectorized` `string manipulation` `Python Data Science Handbook`
>
同時參考:
- pandas官方文件[Working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)
- http://www.datasciencemadesimple.com/string-compare-in-pandas-python-test-whether-two-strings-are-equal-2/
[TOC]
{%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %}
## Introducing Pandas String Operations
We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:
> 如[Computation on NumPy Arrays: Universal Functions](/jThpTB_KTTmZqYzP0CRkWg)的介紹,`array`可以很容易的用vectorization of operations的方式直接對各elements做計算
```python=
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
# array([ 4, 6, 10, 14, 22, 26])
```
> However, For ==arrays of strings==, NumPy does *not* provide such simple access, and thus you're stuck using a more verbose loop syntax:
```python=
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
# ['Peter', 'Paul', 'Mary', 'Guido']
```
> 這樣操作不僅比較麻煩,當資料中有missing data時還會出現error
```python=
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]
```

> `Pandas` includes features to address both this need for <font color=#0099ff>vectorized string operations</font> and for <font color=#0099ff>correctly handling missing data</font> via the `str` attribute of Pandas `Series` and `Index` objects containing strings.
```python=
import pandas as pd
names = pd.Series(data)
names
#0 peter
#1 Paul
#2 None
#3 MARY
#4 gUIDO
#dtype: object
```
> 例如用`str` attribute 進行 capitalize
```python=
names.str.capitalize()
#0 Peter
#1 Paul
#2 None
#3 Mary
#4 Guido
#dtype: object
```
## Tables of Pandas String Methods
pandas的string操作方式與python原生的字串操作頗為相似
```python=
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])
```
> Here is a list of Pandas str methods that mirror Python string methods:

> Notice that these have ==various return values==. Some, like `lower()`, return a series of strings:
```python=
monte.str.lower()
#0 graham chapman
#1 john cleese
#2 terry gilliam
#3 eric idle
#4 terry jones
#5 michael palin
#dtype: object
```
> But some others return numbers:
```python=
monte.str.len()
#0 14
#1 11
#2 13
#3 9
#4 11
#5 13
#dtype: int64
```
> Or Boolean values:
```python=
monte.str.startswith('T')
#0 False
#1 False
#2 True
#3 False
#4 True
#5 False
#dtype: bool
```
> Still others return `lists` or other compound values for each element:
```python=
monte.str.split()
#0 [Graham, Chapman]
#1 [John, Cleese]
#2 [Terry, Gilliam]
#3 [Eric, Idle]
#4 [Terry, Jones]
#5 [Michael, Palin]
#dtype: object
```
### Methods using regular expressions
有些`str`的method可以使用regular expression匹配字串(調用`re`的function)
|Method |Description|
|:-:|:-:|
`match()` |Call `re.match()` on each element, returning a boolean.
`extract()`|Call `re.match()` on each element, returning matched groups as strings.
`findall()`|Call `re.findall()` on each element
`replace()`|Replace occurrences of pattern with some other string
`contains()`|Call `re.search()` on each element, returning a boolean
`count()`|Count occurrences of pattern
`split()`|Equivalent to `str.split()`, but accepts regexps
`rsplit()`|Equivalent to `str.rsplit()`, but accepts regexps
> With these, you can do a wide range of interesting operations. For example, we can extract the ==first name== from each by asking for a <font color=#0099ff>contiguous group of characters at the beginning</font> of each element:
>
```python=
monte.str.extract('([A-Za-z]+)', expand=False)
#0 Graham
#1 John
#2 Terry
#3 Eric
#4 Terry
#5 Michael
#dtype: object
```
`expand` : `bool`, default False
- If `True`, return `DataFrame`.
- If `False`, return `Series`/`Index`/`DataFrame`.
[str.extract補充](#`str.extract`其他運用)
>Or finding all names that <font color=#0099ff>start and end with a consonant</font>, making use of the ==start-of-string== (`^`) and ==end-of-string== (`$`) regular expression characters:
> `str.findall()`
```python=
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
#0 [Graham Chapman]
#1 []
#2 [Terry Gilliam]
#3 []
#4 [Terry Jones]
#5 [Michael Palin]
#dtype: object
```
:waning_crescent_moon: `str.findall()`會在每個element返回`list`
[str.findall補充](#`str.findall()`補充)
### Miscellaneous methods
> Finally, there are some miscellaneous methods that enable other convenient operations:
Method|Description
|:--:|:--:|
`get()`|Index each element
`slice()`|Slice each element
`slice_replace()`|Replace slice in each element with passed value
`cat()`|Concatenate strings
`repeat()`|Repeat values
`normalize()`|Return Unicode form of string
`pad()`|Add whitespace to left, right, or both sides of strings
`wrap()`|Split long strings into lines with length less than a given width
`join()`|Join strings in each element of the Series with passed separator
`get_dummies()`|extract dummy variables as a dataframe
#### Vectorized item access and slicing
> `get()`跟`slice()`能對個別的element取值,並對整個array進行像量化操作:
```python=
monte.str[0:3]
#0 Gra
#1 Joh
#2 Ter
#3 Eri
#4 Ter
#5 Mic
#dtype: object
```
:notes: `monte.str[0:3]` 等於 `monte.str.slice(0,3)`;同樣`df.str.get(i)` 等價於 `df.str[i]`
> `get()`跟`slice()`不只能作用在文在,如同python bulid-in一樣,能對`list`裡的index取值
> 結合`split()`(返回的每個elements都是`list`),可以很簡單的取出last name
>
```python=
monte.str.split().str.get(-1)
#0 Chapman
#1 Cleese
#2 Gilliam
#3 Idle
#4 Jones
#5 Palin
#dtype: object
```
:notebook: `monte.str.split().str.get(-1)` 也可以表示為`monte.str.split().str[-1]`
##### indexing with `str` 的特殊狀況
> `[]`取出特定index的字符時,遇到超過文字長度的index則會return NnN
>
```python=
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
'CABA', 'dog', 'cat'])
s.str[0]
#0 A
#1 B
#2 C
#3 A
#4 B
#5 NaN
#6 C
#7 d
#8 c
#dtype: object
s.str[1]
#0 NaN
#1 NaN
#2 NaN
#3 a
#4 a
#5 NaN
#6 A
#7 o
#8 a
#dtype: object
```
#### Indicator variables
> 假設:A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":
```python=
full_monte = pd.DataFrame({'name': monte,
'info': ['B|C|D', 'B|D', 'A|C',
'B|D', 'B|C', 'B|C|D']})
full_monte
```

> 用`get_dummies()`快速分割欄位並轉為dummy variable
```python=
full_monte['info'].str.get_dummies('|')
```

## Concatenation
### Concatenating a single Series into a string
> `str.cat()`可以將`series`合併成`str`
```python=
s = pd.Series(['a', 'b', 'c', 'd'])
s.str.cat(sep=',')
# 'a,b,c,d'
```
> `str.cat()`預設會忽略missing value,`na_rep`參數可以設定遇到遺漏值時要代入的值
>
```python=
t = pd.Series(['a', 'b', np.nan, 'd'])
t.str.cat(sep=',')
# 'a,b,d'
t.str.cat(sep=',', na_rep='-')
# 'a,b,-,d'
```
### Concatenating a Series and something list-like into a Series
> `str.cat()` 第一個arguments放長度與原`series`一樣的list-like object,會return每個elements合併的結果
>
```python=
s.str.cat(['A', 'B', 'C', 'D'])
#0 aA
#1 bB
#2 cC
#3 dD
#dtype: object
```
> 合併時如果任一邊的elements有遺漏值時都會回傳`NaN`
```python=
s.str.cat(t)
#0 aa
#1 bb
#2 NaN
#3 dd
#dtype: object
s.str.cat(t, na_rep='-')
#0 aa
#1 bb
#2 c-
#3 dd
#dtype: object
```
:mag: 可以用`na_rep`處理此問題
### Extract first match in each subject (extract)
## Extention
### `str.extract`其他運用
> 取出多組match的結果會變成多欄的DF
```python=
s = Series(['a1', 'b2', 'c3'])
s.str.extract('([ab])(\d)')
# 0 1
#0 a 1
#1 b 2
#2 NaN NaN
```
:waning_crescent_moon: 第二列有一組無法match,此時兩欄都會是`NaN`
> 用`?`表示該組並非必要的組別
```python=
s.str.extract('([ab])?(\d)')
# 0 1
#0 a 1
#1 b 2
#2 NaN 3
```
> 還可以對column取名
```python=
s.str.extract('(?P<letter>[ab])(?P<digit>\d)')
# letter digit
#0 a 1
#1 b 2
#2 NaN NaN
```
> 如果只match一組 & `expand = False`,則會返回`pd.series`而不是1 column DF
>
```python=
s.str.extract('[ab](\d)', expand=False)
#0 1
#1 2
#2 NaN
#dtype: object
```
### `str.findall()`補充
> `str.findall()`只會返回有match到的文字
```python=
s = pd.Series(['Lion', 'Monkey', 'Rabbit'])
s.str.findall('Monkey')
#0 []
#1 [Monkey]
#2 []
#dtype: object
s.str.findall('on')
#0 [on]
#1 [on]
#2 []
#dtype: object
```
> 如果patten在一個elements match成功多次,則會return list of multiple string
>
```python=
s.str.findall('b')
#0 []
#1 []
#2 [b, b]
#dtype: object
```
### `str.strip()`處理文字前後的空白或指定字符
> 在切割或處理字串前最後先確認文字前後沒有奇怪的內容
>
> 向量化的原生[str.strip](https://docs.python.org/3/library/stdtypes.html#str.strip)
```python=
s = pd.Series(['1. Ant. ', '2. Bee!\n', '3. Cat?\t', np.nan])
s
#0 1. Ant.
#1 2. Bee!\n
#2 3. Cat?\t
#3 NaN
#dtype: object
s.str.strip()
#0 1. Ant.
#1 2. Bee!
#2 3. Cat?
#3 NaN
#dtype: object
```
> 只處理單邊 & 指定要處理的文字
```python=
s.str.lstrip('123.')
#0 Ant.
#1 Bee!\n
#2 Cat?\t
#3 NaN
#dtype: object
s.str.rstrip('.!? \n\t')
#0 1. Ant
#1 2. Bee
#2 3. Cat
#3 NaN
#dtype: object
```
> 同時處理兩邊 & 指定處理文字
```python=
s.str.strip('123.!? \n\t')
#0 Ant
#1 Bee
#2 Cat
#3 NaN
#dtype: object
```
### 處理 `Index` (e.g. reset column name)
> 除了`series`,`str`也可以作用於index
```python=
idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])
idx.str.strip()
# Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
```
> 這個方法很適合拿來清裡column name
```python=
df = pd.DataFrame(np.random.randn(3, 2),
columns=[' Column A ', ' Column B '], index=range(3))
df
# Column A Column B
#0 0.469112 -0.282863
#1 -1.509059 -1.135632
#2 1.212112 -0.173215
```
> 例如清除column name兩邊空白
```python=
df.columns.str.strip()
# Index(['Column A', 'Column B'], dtype='object')
```
> chain of `str` attribute, reset column name
```python=
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df
# column_a column_b
#0 0.469112 -0.282863
#1 -1.509059 -1.135632
#2 1.212112 -0.173215
```
### `str.replace()`的其他應用
`str.replace()`有兩個主要參數
- pat: *str or compiled regex* 要match的文字
- repl: *str or callable* 取代的文字
> 返回反向文字
```python=
repl = lambda m: m.group(0)[::-1]
pd.Series(['foo 123', 'bar baz', np.nan]).str.replace(r'[a-z]+', repl)
#0 oof 123
#1 rab zab
#2 NaN
#dtype: object
```
:notebook: "123"因為沒有匹配,所以不會轉為反向文字
> 取出第二組匹配結果,並大小寫互換
```python=
pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
repl = lambda m: m.group('two').swapcase()
pd.Series(['One Two Three', 'Foo Bar Baz']).str.replace(pat, repl)
#0 tWO
#1 bAR
#dtype: object
```
### Test whether two strings are equal
```python=
df1 = {
'State':['Arizona','Georgia','Newyork','Indiana','Florida'],
'State_1':['New Jersey','georgia','Newyork','Indiana','florida']}
df1 = pd.DataFrame(df1,columns=['State','State_1'])
print(df1)
```

> 比較兩欄是否完全一樣
>
```python=
df1['is_equal']= (df1['State']==df1['State_1'])
print(df1)
```

> 比較時忽略大小寫及空白
>
```python=
df1['is_equal'] =( df1['State'].str.lower().str.replace('s/+',"") == df1['State_1'].str.lower().str.replace('s/+',""))
print(df1)
```
