Hierarchical Indexing(未完成)

# Hierarchical Indexing(未完成) > Lee Tsung-Tang > ###### tags: `python` `pandas` `index` `Python Data Science Handbook` 引用整理自[Python Data Science Handbook CH2](https://jakevdp.github.io/PythonDataScienceHandbook/) [TOC] {%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %} > pandas可以處理更高維度的資料 > `Panel` and `Panel4D` objects that natively handle three-dimensional and four-dimensional data (see Aside: [Panel Data](#Panel-Data)) ## A Multiply Indexed Series ### The bad way > 用`tuple` 創建multiple index > index : states from two different years ```python= index = [('California', 2000), ('California', 2010), ('New York', 2000), ('New York', 2010), ('Texas', 2000), ('Texas', 2010)] populations = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561] pop = pd.Series(populations, index=index) pop #(California, 2000) 33871648 #(California, 2010) 37253956 #(New York, 2000) 18976457 #(New York, 2010) 19378102 #(Texas, 2000) 20851820 #(Texas, 2010) 25145561 #dtype: int64 ``` > With this indexing scheme, you can straightforwardly ==index== or ==slice== the series based on this multiple index: > 用`tuple`包含multiple index ```python= pop[('California', 2010):('Texas', 2000)] #(California, 2010) 37253956 #New York, 2000) 18976457 #(New York, 2010) 19378102 #(Texas, 2000) 20851820 #dtype: int64 ``` > 有點不方便的是，如果想要取出所有 2010 的value，你需要改寫成這樣(而且程式很慢): ```python= pop[[i for i in pop.index if i[1] == 2010]] #(California, 2010) 37253956 #(New York, 2010) 19378102 #(Texas, 2010) 25145561 #dtype: int64 ``` ### The Better Way: Pandas MultiIndex > The ***Pandas `MultiIndex` type*** gives us the type of operations we wish to have. > 同樣可以用tuple 創造 MultiIndex ```python= index = pd.MultiIndex.from_tuples(index) index #MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]], # labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]]) ``` > MultiIndex contains multiple levels > 兩個labels分別表示各data point在兩個index的編碼 > 將 `pop series` re-index為 MultiIndex ```python= pop = pop.reindex(index) pop #California 2000 33871648 # 2010 37253956 #New York 2000 18976457 # 2010 19378102 #Texas 2000 20851820 # 2010 25145561 #dtype: int64 ``` > 當multiindex設定完成後要取出第2個index == 2010的data，可以簡單用Pandas slicing： ```python= pop[:, 2010] #California 37253956 #New York 19378102 #Texas 25145561 #dtype: int64 ``` > 返回一個singly indexed `array` ### MultiIndex as extra dimension > `unstack()` method 可以直接將 multiply indexed `Series` 轉為 indexed `DataFrame` ```python= pop_df = pop.unstack() pop_df # 2000 2010 #California 33871648 37253956 #New York 18976457 19378102 #Texas 20851820 25145561 ``` > stack() 則正好相反 ```python= pop_df.stack() #California 2000 33871648 # 2010 37253956 #New York 2000 18976457 # 2010 19378102 #Texas 2000 20851820 # 2010 25145561 #dtype: int64 ``` > multiple-index是為了處理多維度單位的資料(例如state-years)，讓資料分析更有彈性 > 在`MultiIndex`的`DataFrame`加入new column ```python= pop_df = pd.DataFrame({'total': pop, 'under18': [9267089, 9284094, 4687374, 4318033, 5906301, 6879014]}) pop_df # total under18 #California 2000 33871648 9267089 # 2010 37253956 9284094 #New York 2000 18976457 4687374 # 2010 19378102 4318033 #Texas 2000 20851820 5906301 # 2010 25145561 6879014 ``` > 此外，ufuncs and other functionality([Operating on Data in Pandas](#Operating-on-Data-in-Pandas))都可以用於hierarchical indices的資料 > > 例如：計算資料中18歲以下人口比例 ```python= f_u18 = pop_df['under18'] / pop_df['total'] f_u18.unstack() # 2000 2010 #California 0.273594 0.249211 #New York 0.247010 0.222831 #Texas 0.283251 0.273568 ``` ## Methods of MultiIndex Creation > ==法1==創建multiply indexed Series or DataFrame的方法是用list of multiple indicex： ```python= df = pd.DataFrame(np.random.rand(4, 2), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=['data1', 'data2']) df # data1 data2 #a 1 0.554233 0.356072 # 2 0.925244 0.219474 #b 1 0.441759 0.610054 # 2 0.171495 0.886688 ``` > ==法2==用`dict`創建，`keys`內用`tuple`包含多個index： ```python= data = {('California', 2000): 33871648, ('California', 2010): 37253956, ('Texas', 2000): 20851820, ('Texas', 2010): 25145561, ('New York', 2000): 18976457, ('New York', 2010): 19378102} pd.Series(data) #California 2000 33871648 # 2010 37253956 #New York 2000 18976457 # 2010 19378102 #Texas 2000 20851820 # 2010 25145561 #dtype: int64 ``` Nevertheless, it is sometimes useful to explicitly create a MultiIndex; we'll see a couple of these methods here. ### Explicit MultiIndex constructors > `pd.MultiIndex`的`method`創造index ==list of arrays== ```python= pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]]) #MultiIndex(levels=[['a', 'b'], [1, 2]], # labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) ``` ==list of tuples== ```python= pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)]) #MultiIndex(levels=[['a', 'b'], [1, 2]], # labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) ``` ==兩個list的所有排列=組合= ```python= pd.MultiIndex.from_product([['a', 'b'], [1, 2]]) #MultiIndex(levels=[['a', 'b'], [1, 2]], # labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) ``` > ==原生的編碼模式== 直接用`levels` (a list of lists containing available index values for each level) and與`labels` (a list of lists that reference these labels)創建`pd.MultiIndex` ```python= pd.MultiIndex(levels=[['a', 'b'], [1, 2]], labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) #MultiIndex(levels=[['a', 'b'], [1, 2]], # labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) ``` :+1: 上面的objects都可以在創造`Series` `DataFrame`時放入 index argument 中，或者以`reindex` method用於已經存在的`Series` `DataFrame`中 ### MultiIndex level names > 將index命名 ```python= pop.index.names = ['state', 'year'] pop # state year #California 2000 33871648 # 2010 37253956 #New York 2000 18976457 # 2010 19378102 #Texas 2000 20851820 # 2010 25145561 #dtype: int64 ``` ### MultiIndex for columns > `DataFrame`的column也能有multiple level ```python= # hierarchical indices and columns index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit']) columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type']) # mock some data data = np.round(np.random.randn(4, 6), 1) data[:, ::2] *= 10 data += 37 # create the DataFrame health_data = pd.DataFrame(data, index=index, columns=columns) health_data ``` ![](https://i.imgur.com/5LdhNJ8.png) > 資料共有4個維度the subject, the measurement type, the year, and the visit number. > > 用index取出個人的所有資訊 > With this in place we can, for example, index the top-level column by the person's name and get a full DataFrame containing just that person's information: ```python= health_data['Guido'] # type HR Temp #year visit #2013 1 32.0 36.7 # 2 50.0 35.0 #2014 1 39.0 37.8 # 2 48.0 37.3 ``` ## Indexing and Slicing a MultiIndex Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions. We'll first look at indexing multiply indexed Series, and then multiply-indexed DataFrames.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.