owned this note
owned this note
Published
Linked with GitHub
# Data Transformation
{%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %}
> Lee Tsung-Tang
> ###### tags: `python` `pandas` `numpy` `Python for Data Analysis`
>
Filtering, cleaning, and other transformations are another class of important operations
## Removing Duplicates
處理重複資料
```python=
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
```

> `duplicated()` method 回傳row是否重複
```python=
data.duplicated()
```

> `drop_duplicate()` method直接刪除重複的rows,僅保留第一筆資料
```python=
data.drop_duplicates()
```

:warning: `duplicated()`、`drop_duplicate()`兩種methods都預設一次檢視資料的所有欄位是否有重複
> 可以藉由指定columns,決定由哪些column判斷該筆資料是否重複
```python=
data['v1'] = range(7)
data.drop_duplicates(['k1'])
```

> 用`keep='last'` arguments將要保留的rows改為最後一筆
>
```python=
data.drop_duplicates(['k1', 'k2'], keep='last')
```

## Transforming Data Using a Function or Mapping
> 在轉換資料時常會參照某個`array` `series`或DF column的值進行。以下範例DF包含兩個欄位:食物與盎司
```python=
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
```

> Suppose you wanted to add a column indicating the ==type of animal== that each food came from.
>
> 以`dict`表示食物及來源動物的關係
```python=+
meat_to_animal = {'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon' }
```
> `map` method 用在`series`時可以包含function 或 dict-like object containing a mapping
>
> 先處理食物名稱轉為小寫後,接著進行再以`dict`轉換value
>
```python=
lowercased = data['food'].str.lower()
lowercased
```

```python=
data['animal'] = lowercased.map(meat_to_animal)
data
```

> `lambda` 匿名函數避免暫存變數
```python=
data['food'].map(lambda x: meat_to_animal[x.lower()])
```

:::success
### `map` <---> `apply`
`map()` 與 `apply()` method 很多時候很像
例如:
```python=
import pandas as pd
x = pd.Series([1,3,5,7,9])
x.map(lambda x: x ** 2)
#0 1
#1 9
#2 25
#3 49
#4 81
#dtype: int64
x.apply(lambda x: x ** 2)
#0 1
#1 9
#2 25
#3 49
#4 81
#dtype: int64
```
>實際上`apply()` method是針對`serise`的每個value逐一執行function;`map()` method 則是對`series`的值進行投影,因此method中可以包含另一個`dictionary`、`series`
>
> 使用`dict`作為參照時,input的原始值為key,output的結果為value
>
```python=
mydict = {
1:10,
2:20,
3:30,
5:50,
7:70,
}
x.map(mydict)
#0 10.0
#1 30.0
#2 50.0
#3 70.0
#4 NaN
#dtype: float64
```
:notebook: 9沒有對應的value,所以回傳`NaN`
> 使用`series`作為參照時則是,input的值為index,output的值為index對應的value
>
```python=
myseries = pd.Series([3,6,9,12,15],index=[1,3,5,8,9])
x.map(myseries)
#0 3.0
#1 6.0
#2 9.0
#3 NaN
#4 15.0
#dtype: float64
```
:notebook: 7沒有對應的value,所以回傳`NaN`
相對的`apply()` method雖然只能接受function,但可以額外加入arguments
```python=
def myfunc(x ,z):
if x % z ==0:
return 100
else:
return x ** 3
x.apply(myfunc , args=(9,))
0# 1
#1 27
#2 125
#3 343
#4 100
#dtype: int64
```
:::
## Replacing Values
> 很多時候當想要replace特定的值,有幾種不同的方案:例如用`fillna()` recode `NaN`或`map()` 調整部分值的內容。不過,`replace()` method提供更一般性也更方便的調適方法:
```python=
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
```

:waning_crescent_moon: `series`裡面的-999代表missing data
> 使用`replace()` 將其recode為 `NaN`
>
```python=
data.replace(-999, np.nan)
```

> 可以放入`list`一次recode多個值
>
```python=
data.replace([-999, -1000], np.nan)
```

> 通常不同的值會有不同的recode結果,可以在output放入`list`作為對照
>
```python=
data.replace([-999, -1000], [np.nan, 0])
```

> 可以直接用`dict`表示:
>
```python=
data.replace({-999: np.nan, -1000: 0})
```

> 一次將多個不同的value recode成特定值
>
```python=
x = dict.fromkeys([-999 ,-1000] , np.nan)
data.replace(x)
```

:::info
The `data.replace` method is distinct from `data.str.replace`, which performs string substitution element-wise.
:::
## Renaming Axis Indexes
> axis labels(explicit index) 與`series`的value一樣,可以由上述的方法進行transformation
```python=
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
```
> `map()`
>
```python=
transform = lambda x: x[:4].upper()
data.index.map(transform)
# Index(['OHIO', 'COLO', 'NEW '], dtype='object')
```
> modifying the `DataFrame` in-place:
>
```python=
data.index = data.index.map(transform)
data
```

> `rename()` method 可以一次修改column/row index labels
>
```python=
data.rename(index=str.title, columns=str.upper)
```

> `rename()` 裡面如果放入`dict`可以只對部分的index進行操作
>
```python=
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
```

> set `inplace` argement = `True`
>
```python=
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data
```

## Discretization and Binning
連續資料常常會需要切成不同的區間以方便分析
```python=
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
```
> divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older
>
> 可以用`pandas`的`cut()` function達成
```python=
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
```

:waning_crescent_moon: The object pandas returns is a special ==Categorical object==.
> The output you see describes the bins computed by `pandas.cut`. You can treat it like an ==array of strings== indicating the bin name; internally it contains a <font color=red>categories array specifying the distinct category names</font> along with a labeling for the ages data in the codes attribute:
> `categorical` object 裡面有代表每個類別的code
```python=
cats.codes
# array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
# IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
# closed='right',
# dtype='interval[int64]')
```
```python=
pd.value_counts(cats)
```

:warning: 在上面各類別區間中,`(`表示該值不包含;`]`表示該值包含(inclusive)。例如`(18, 25]` 表示 19~25歲區間
> You can also pass your own *bin names* by passing a `list` or `array` to the ==labels option==:
```python=
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
```

<br/>
> 如果在`pd.cut()`中放入 integer number of bins而非明確的邊界(`list`)
> it will compute ==equal-length bins== based on the minimum and maximum values in the data
>
```python=
data = np.random.rand(20) # uniformly distributed data
pd.cut(data, 4, precision=2)
```

:notebook: `precision=2` 限制數字到小數點後2位
> `qcut` bins the data based on ==sample quantiles==
```python=
data = np.random.randn(1000) # Normally distributed
cats = pd.qcut(data, 4) # Cut into quartiles
cats
```

```python=
pd.value_counts(cats)
```

> Similar to cut you can pass your own quantiles (numbers **between 0 and 1**, ==inclusive==):
>
```python=
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
```

## Detecting and Filtering Outliers
Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:
```python=
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
```

> 找出特定欄位中大於絕對值3的列
>
```python=
col = data[2]
col[np.abs(col) > 3]
```

> subset DF 中任何欄為有大於絕對值3的列
```python=
data[(np.abs(data) > 3).any(1)]
```

> 對資料的上下限進行限制
>
```python=
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
```

:::warning
The statement `np.sign(data)` produces 1 and –1 values based on whether the values in data are positive or negative:
```python=
np.sign(data).head()
```

所以如果 value > 3 會recode為 3;反之value < -3 ---> -3
:::
## Permutation and Random Sampling
Permuting (randomly reordering) a `Series` or the rows in a `DataFrame` is easy to do using the `numpy.random.permutation` function.
> 使用`numpy.random.permutation()` 搭配特定維度長度的參數(e.g. 5列),會回傳integer代表隨機的new ordering indeces
```python=
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler
# array([3, 1, 4, 2, 0])
```
> `iloc` 或 `take` method reordering DF
>
```python=
df
df.take(sampler)
```
|df|df.take(sampler)|
|-|-|
|
> To select a random subset ==without replacement==, you can use the `sample()` method on `Series` and `DataFrame`
> 抽出不放回
```python=
df.sample(n=3)
```

> 抽出後放回`replace=True`
>
```python=
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws
```

:notebook: `random_state`參數可以設定seed;`weights`可以設定每個樣本被抽中的機率
## Computing Indicator/Dummy Variables
converting a categorical variable into a ==“dummy” or “indicator” matrix==
```python=
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data1': range(6)})
pd.get_dummies(df['key'])
```

> `prefix` arguments可以在indicator matrix的欄位前加上前綴
```python=
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
```

If a row in a DataFrame belongs to multiple categories, things are a bit more compli‐ cated. Let’s look at the MovieLens 1M dataset, which is investigated in more detail in Chapter 14:
> 有時候會遇到單一欄位裡面包含多種類別資訊的狀況:
>
```python=
mnames = ['movie_id', 'title', 'genres'
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
header=None, names=mnames)
movies[:10]
```

:waning_crescent_moon: `genres`欄位有多種類別的資訊
> 1. 先list all unique genres
```python=
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
```

> 2. 建立indicator matrix
> One way to construct the indicator DataFrame is to start with a DataFrame of all zeros:
```python=
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
```
:waning_crescent_moon: 長為資料數(rows)寬為genres類別數的0 array
> 3. 逐一將各筆資料中有出線的類別recode為1
>
```python=
gen = movies.genres[0]
gen.split('|')
# ['Animation', "Children's", 'Comedy']
# 取得上面三個類別在indicator matrix的column index
dummies.columns.get_indexer(gen.split('|'))
# array([0, 1, 2])
```
> Then, we can use `.iloc` to set values(1) based on these indices:
>
```python=
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
```
> 合併回原始資料
>
```python=
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
```

>在分析時有時會將連續的變項拆成多個dummy variable進行,可以結合`get_dummies` method 以及 discretization function like `cut`:
> 例如先將年齡切成10歲一組,再將各組轉為indicator matrix
```python=
np.random.seed(12345)
values = np.random.rand(10)
values
# array([ 0.9296, 0.3164, 0.1839, 0.2046, 0.5677, 0.5955, 0.9645,0.6532, 0.7489, 0.6536])
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))
```
