Data Transformation

# Data Transformation {%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %} > Lee Tsung-Tang > ###### tags: `python` `pandas` `numpy` `Python for Data Analysis` > Filtering, cleaning, and other transformations are another class of important operations ## Removing Duplicates 處理重複資料 ```python= data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]}) data ``` ![](https://i.imgur.com/pBYEZms.png) > `duplicated()` method 回傳row是否重複 ```python= data.duplicated() ``` ![](https://i.imgur.com/zgfDMS8.png) > `drop_duplicate()` method直接刪除重複的rows，僅保留第一筆資料 ```python= data.drop_duplicates() ``` ![](https://i.imgur.com/4AnpAar.png) :warning: `duplicated()`、`drop_duplicate()`兩種methods都預設一次檢視資料的所有欄位是否有重複 > 可以藉由指定columns，決定由哪些column判斷該筆資料是否重複 ```python= data['v1'] = range(7) data.drop_duplicates(['k1']) ``` ![](https://i.imgur.com/dVg8332.png) > 用`keep='last'` arguments將要保留的rows改為最後一筆 > ```python= data.drop_duplicates(['k1', 'k2'], keep='last') ``` ![](https://i.imgur.com/ushiiHZ.png) ## Transforming Data Using a Function or Mapping > 在轉換資料時常會參照某個`array` `series`或DF column的值進行。以下範例DF包含兩個欄位：食物與盎司 ```python= data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'], 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]}) data ``` ![](https://i.imgur.com/5CQqGuX.png) > Suppose you wanted to add a column indicating the ==type of animal== that each food came from. > > 以`dict`表示食物及來源動物的關係 ```python=+ meat_to_animal = {'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon' } ``` > `map` method 用在`series`時可以包含function 或 dict-like object containing a mapping > > 先處理食物名稱轉為小寫後，接著進行再以`dict`轉換value > ```python= lowercased = data['food'].str.lower() lowercased ``` ![](https://i.imgur.com/wVTB1d6.png) ```python= data['animal'] = lowercased.map(meat_to_animal) data ``` ![](https://i.imgur.com/JKJAAQE.png) > `lambda` 匿名函數避免暫存變數 ```python= data['food'].map(lambda x: meat_to_animal[x.lower()]) ``` ![](https://i.imgur.com/59GurnB.png) :::success ### `map` <---> `apply` `map()` 與 `apply()` method 很多時候很像例如: ```python= import pandas as pd x = pd.Series([1,3,5,7,9]) x.map(lambda x: x ** 2) #0 1 #1 9 #2 25 #3 49 #4 81 #dtype: int64 x.apply(lambda x: x ** 2) #0 1 #1 9 #2 25 #3 49 #4 81 #dtype: int64 ``` >實際上`apply()` method是針對`serise`的每個value逐一執行function；`map()` method 則是對`series`的值進行投影，因此method中可以包含另一個`dictionary`、`series` > > 使用`dict`作為參照時，input的原始值為key，output的結果為value > ```python= mydict = { 1:10, 2:20, 3:30, 5:50, 7:70, } x.map(mydict) #0 10.0 #1 30.0 #2 50.0 #3 70.0 #4 NaN #dtype: float64 ``` :notebook: 9沒有對應的value，所以回傳`NaN` > 使用`series`作為參照時則是，input的值為index，output的值為index對應的value > ```python= myseries = pd.Series([3,6,9,12,15],index=[1,3,5,8,9]) x.map(myseries) #0 3.0 #1 6.0 #2 9.0 #3 NaN #4 15.0 #dtype: float64 ``` :notebook: 7沒有對應的value，所以回傳`NaN` 相對的`apply()` method雖然只能接受function，但可以額外加入arguments ```python= def myfunc(x ,z): if x % z ==0: return 100 else: return x ** 3 x.apply(myfunc , args=(9,)) 0# 1 #1 27 #2 125 #3 343 #4 100 #dtype: int64 ``` ::: ## Replacing Values > 很多時候當想要replace特定的值，有幾種不同的方案：例如用`fillna()` recode `NaN`或`map()` 調整部分值的內容。不過，`replace()` method提供更一般性也更方便的調適方法: ```python= data = pd.Series([1., -999., 2., -999., -1000., 3.]) data ``` ![](https://i.imgur.com/6bct0RO.png) :waning_crescent_moon: `series`裡面的-999代表missing data > 使用`replace()` 將其recode為 `NaN` > ```python= data.replace(-999, np.nan) ``` ![](https://i.imgur.com/THscHc8.png) > 可以放入`list`一次recode多個值 > ```python= data.replace([-999, -1000], np.nan) ``` ![](https://i.imgur.com/frJXQML.png) > 通常不同的值會有不同的recode結果，可以在output放入`list`作為對照 > ```python= data.replace([-999, -1000], [np.nan, 0]) ``` ![](https://i.imgur.com/jzcjBfl.png) > 可以直接用`dict`表示： > ```python= data.replace({-999: np.nan, -1000: 0}) ``` ![](https://i.imgur.com/PmXdibT.png) > 一次將多個不同的value recode成特定值 > ```python= x = dict.fromkeys([-999 ,-1000] , np.nan) data.replace(x) ``` ![](https://i.imgur.com/eMXO4Dn.png) :::info The `data.replace` method is distinct from `data.str.replace`, which performs string substitution element-wise. ::: ## Renaming Axis Indexes > axis labels(explicit index) 與`series`的value一樣，可以由上述的方法進行transformation ```python= data = pd.DataFrame(np.arange(12).reshape((3, 4)), index=['Ohio', 'Colorado', 'New York'], columns=['one', 'two', 'three', 'four']) ``` > `map()` > ```python= transform = lambda x: x[:4].upper() data.index.map(transform) # Index(['OHIO', 'COLO', 'NEW '], dtype='object') ``` > modifying the `DataFrame` in-place: > ```python= data.index = data.index.map(transform) data ``` ![](https://i.imgur.com/Qrowb0P.png) > `rename()` method 可以一次修改column/row index labels > ```python= data.rename(index=str.title, columns=str.upper) ``` ![](https://i.imgur.com/2iGijC9.png) > `rename()` 裡面如果放入`dict`可以只對部分的index進行操作 > ```python= data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'}) ``` ![](https://i.imgur.com/ZKI5rgi.png) > set `inplace` argement = `True` > ```python= data.rename(index={'OHIO': 'INDIANA'}, inplace=True) data ``` ![](https://i.imgur.com/usMEDqH.png) ## Discretization and Binning 連續資料常常會需要切成不同的區間以方便分析 ```python= ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32] ``` > divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older > > 可以用`pandas`的`cut()` function達成 ```python= bins = [18, 25, 35, 60, 100] cats = pd.cut(ages, bins) cats ``` ![](https://i.imgur.com/S975Ewa.png) :waning_crescent_moon: The object pandas returns is a special ==Categorical object==. > The output you see describes the bins computed by `pandas.cut`. You can treat it like an ==array of strings== indicating the bin name; internally it contains a <font color=red>categories array specifying the distinct category names</font> along with a labeling for the ages data in the codes attribute: > `categorical` object 裡面有代表每個類別的code ```python= cats.codes # array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8) cats.categories # IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]] # closed='right', # dtype='interval[int64]') ``` ```python= pd.value_counts(cats) ``` ![](https://i.imgur.com/9l9gRMg.png) :warning: 在上面各類別區間中，`(`表示該值不包含；`]`表示該值包含(inclusive)。例如`(18, 25]` 表示 19~25歲區間 > You can also pass your own *bin names* by passing a `list` or `array` to the ==labels option==: ```python= group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior'] pd.cut(ages, bins, labels=group_names) ``` ![](https://i.imgur.com/1IUfXlo.png) <br/> > 如果在`pd.cut()`中放入 integer number of bins而非明確的邊界(`list`) > it will compute ==equal-length bins== based on the minimum and maximum values in the data > ```python= data = np.random.rand(20) # uniformly distributed data pd.cut(data, 4, precision=2) ``` ![](https://i.imgur.com/biesqLX.png) :notebook: `precision=2` 限制數字到小數點後2位 > `qcut` bins the data based on ==sample quantiles== ```python= data = np.random.randn(1000) # Normally distributed cats = pd.qcut(data, 4) # Cut into quartiles cats ``` ![](https://i.imgur.com/8KfCwt3.png) ```python= pd.value_counts(cats) ``` ![](https://i.imgur.com/ZLxMAmc.png) > Similar to cut you can pass your own quantiles (numbers **between 0 and 1**, ==inclusive==): > ```python= pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]) ``` ![](https://i.imgur.com/zXmhbYZ.png) ## Detecting and Filtering Outliers Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data: ```python= data = pd.DataFrame(np.random.randn(1000, 4)) data.describe() ``` ![](https://i.imgur.com/6tlqNHt.png) > 找出特定欄位中大於絕對值3的列 > ```python= col = data[2] col[np.abs(col) > 3] ``` ![](https://i.imgur.com/mZRhL9F.png) > subset DF 中任何欄為有大於絕對值3的列 ```python= data[(np.abs(data) > 3).any(1)] ``` ![](https://i.imgur.com/gxKgQkf.png) > 對資料的上下限進行限制 > ```python= data[np.abs(data) > 3] = np.sign(data) * 3 data.describe() ``` ![](https://i.imgur.com/POeLR20.png) :::warning The statement `np.sign(data)` produces 1 and –1 values based on whether the values in data are positive or negative: ```python= np.sign(data).head() ``` ![](https://i.imgur.com/ouoUlFu.png) 所以如果 value > 3 會recode為 3；反之value < -3 ---> -3 ::: ## Permutation and Random Sampling Permuting (randomly reordering) a `Series` or the rows in a `DataFrame` is easy to do using the `numpy.random.permutation` function. > 使用`numpy.random.permutation()` 搭配特定維度長度的參數(e.g. 5列)，會回傳integer代表隨機的new ordering indeces ```python= df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4))) sampler = np.random.permutation(5) sampler # array([3, 1, 4, 2, 0]) ``` > `iloc` 或 `take` method reordering DF > ```python= df df.take(sampler) ``` |df|df.take(sampler)| |-|-| ![](https://i.imgur.com/DIQebdl.png)|![](https://i.imgur.com/jMiPdHC.png) > To select a random subset ==without replacement==, you can use the `sample()` method on `Series` and `DataFrame` > 抽出不放回 ```python= df.sample(n=3) ``` ![](https://i.imgur.com/IRTWYf6.png) > 抽出後放回`replace=True` > ```python= choices = pd.Series([5, 7, -1, 6, 4]) draws = choices.sample(n=10, replace=True) draws ``` ![](https://i.imgur.com/NPRDgij.png) :notebook: `random_state`參數可以設定seed；`weights`可以設定每個樣本被抽中的機率 ## Computing Indicator/Dummy Variables converting a categorical variable into a ==“dummy” or “indicator” matrix== ```python= df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)}) pd.get_dummies(df['key']) ``` ![](https://i.imgur.com/1NdjE3A.png) > `prefix` arguments可以在indicator matrix的欄位前加上前綴 ```python= dummies = pd.get_dummies(df['key'], prefix='key') df_with_dummy = df[['data1']].join(dummies) df_with_dummy ``` ![](https://i.imgur.com/IL8t32B.png) If a row in a DataFrame belongs to multiple categories, things are a bit more compli‐ cated. Let’s look at the MovieLens 1M dataset, which is investigated in more detail in Chapter 14: > 有時候會遇到單一欄位裡面包含多種類別資訊的狀況: > ```python= mnames = ['movie_id', 'title', 'genres' movies = pd.read_table('datasets/movielens/movies.dat', sep='::', header=None, names=mnames) movies[:10] ``` ![](https://i.imgur.com/CxOxCCu.png) :waning_crescent_moon: `genres`欄位有多種類別的資訊 > 1. 先list all unique genres ```python= all_genres = [] for x in movies.genres: all_genres.extend(x.split('|')) genres = pd.unique(all_genres) genres ``` ![](https://i.imgur.com/mhNX9Tp.png) > 2. 建立indicator matrix > One way to construct the indicator DataFrame is to start with a DataFrame of all zeros: ```python= zero_matrix = np.zeros((len(movies), len(genres))) dummies = pd.DataFrame(zero_matrix, columns=genres) ``` :waning_crescent_moon: 長為資料數(rows)寬為genres類別數的0 array > 3. 逐一將各筆資料中有出線的類別recode為1 > ```python= gen = movies.genres[0] gen.split('|') # ['Animation', "Children's", 'Comedy'] # 取得上面三個類別在indicator matrix的column index dummies.columns.get_indexer(gen.split('|')) # array([0, 1, 2]) ``` > Then, we can use `.iloc` to set values(1) based on these indices: > ```python= for i, gen in enumerate(movies.genres): indices = dummies.columns.get_indexer(gen.split('|')) dummies.iloc[i, indices] = 1 ``` > 合併回原始資料 > ```python= movies_windic = movies.join(dummies.add_prefix('Genre_')) movies_windic.iloc[0] ``` ![](https://i.imgur.com/827hN4n.png) >在分析時有時會將連續的變項拆成多個dummy variable進行，可以結合`get_dummies` method 以及 discretization function like `cut`: > 例如先將年齡切成10歲一組，再將各組轉為indicator matrix ```python= np.random.seed(12345) values = np.random.rand(10) values # array([ 0.9296, 0.3164, 0.1839, 0.2046, 0.5677, 0.5955, 0.9645,0.6532, 0.7489, 0.6536]) bins = [0, 0.2, 0.4, 0.6, 0.8, 1] pd.get_dummies(pd.cut(values, bins)) ``` ![](https://i.imgur.com/wN1nHD2.png)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.