owned this note
owned this note
Published
Linked with GitHub
# Hierarchical Indexing(未完成)
> Lee Tsung-Tang
> ###### tags: `python` `pandas` `index` `Python Data Science Handbook`
引用整理自[Python Data Science Handbook CH2](https://jakevdp.github.io/PythonDataScienceHandbook/)
[TOC]
{%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %}
> pandas可以處理更高維度的資料
> `Panel` and `Panel4D` objects that natively handle three-dimensional and four-dimensional data (see Aside: [Panel Data](#Panel-Data))
## A Multiply Indexed Series
### The bad way
> 用`tuple` 創建multiple index
> index : states from two different years
```python=
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
#(California, 2000) 33871648
#(California, 2010) 37253956
#(New York, 2000) 18976457
#(New York, 2010) 19378102
#(Texas, 2000) 20851820
#(Texas, 2010) 25145561
#dtype: int64
```
> With this indexing scheme, you can straightforwardly ==index== or ==slice== the series based on this multiple index:
> 用`tuple`包含multiple index
```python=
pop[('California', 2010):('Texas', 2000)]
#(California, 2010) 37253956
#New York, 2000) 18976457
#(New York, 2010) 19378102
#(Texas, 2000) 20851820
#dtype: int64
```
> 有點不方便的是,如果想要取出所有 2010 的value,你需要改寫成這樣(而且程式很慢):
```python=
pop[[i for i in pop.index if i[1] == 2010]]
#(California, 2010) 37253956
#(New York, 2010) 19378102
#(Texas, 2010) 25145561
#dtype: int64
```
### The Better Way: Pandas MultiIndex
> The ***Pandas `MultiIndex` type*** gives us the type of operations we wish to have.
> 同樣可以用tuple 創造 MultiIndex
```python=
index = pd.MultiIndex.from_tuples(index)
index
#MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
# labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
```
> MultiIndex contains multiple levels
> 兩個labels分別表示各data point在兩個index的編碼
> 將 `pop series` re-index為 MultiIndex
```python=
pop = pop.reindex(index)
pop
#California 2000 33871648
# 2010 37253956
#New York 2000 18976457
# 2010 19378102
#Texas 2000 20851820
# 2010 25145561
#dtype: int64
```
> 當multiindex設定完成後要取出第2個index == 2010的data,可以簡單用Pandas slicing:
```python=
pop[:, 2010]
#California 37253956
#New York 19378102
#Texas 25145561
#dtype: int64
```
> 返回一個singly indexed `array`
### MultiIndex as extra dimension
> `unstack()` method 可以直接將 multiply indexed `Series` 轉為 indexed `DataFrame`
```python=
pop_df = pop.unstack()
pop_df
# 2000 2010
#California 33871648 37253956
#New York 18976457 19378102
#Texas 20851820 25145561
```
> stack() 則正好相反
```python=
pop_df.stack()
#California 2000 33871648
# 2010 37253956
#New York 2000 18976457
# 2010 19378102
#Texas 2000 20851820
# 2010 25145561
#dtype: int64
```
> multiple-index是為了處理多維度單位的資料(例如state-years),讓資料分析更有彈性
> 在`MultiIndex`的`DataFrame`加入new column
```python=
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
# total under18
#California 2000 33871648 9267089
# 2010 37253956 9284094
#New York 2000 18976457 4687374
# 2010 19378102 4318033
#Texas 2000 20851820 5906301
# 2010 25145561 6879014
```
> 此外,ufuncs and other functionality([Operating on Data in Pandas](#Operating-on-Data-in-Pandas))都可以用於hierarchical indices的資料
>
> 例如:計算資料中18歲以下人口比例
```python=
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()
# 2000 2010
#California 0.273594 0.249211
#New York 0.247010 0.222831
#Texas 0.283251 0.273568
```
## Methods of MultiIndex Creation
> ==法1==創建multiply indexed Series or DataFrame的方法是用list of multiple indicex:
```python=
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
# data1 data2
#a 1 0.554233 0.356072
# 2 0.925244 0.219474
#b 1 0.441759 0.610054
# 2 0.171495 0.886688
```
> ==法2==用`dict`創建,`keys`內用`tuple`包含多個index:
```python=
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
#California 2000 33871648
# 2010 37253956
#New York 2000 18976457
# 2010 19378102
#Texas 2000 20851820
# 2010 25145561
#dtype: int64
```
Nevertheless, it is sometimes useful to explicitly create a MultiIndex; we'll see a couple of these methods here.
### Explicit MultiIndex constructors
> `pd.MultiIndex`的`method`創造index
==list of arrays==
```python=
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
#MultiIndex(levels=[['a', 'b'], [1, 2]],
# labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
```
==list of tuples==
```python=
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
#MultiIndex(levels=[['a', 'b'], [1, 2]],
# labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
```
==兩個list的所有排列=組合=
```python=
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
#MultiIndex(levels=[['a', 'b'], [1, 2]],
# labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
```
> ==原生的編碼模式== 直接用`levels` (a list of lists containing available index values for each level) and與`labels` (a list of lists that reference these labels)創建`pd.MultiIndex`
```python=
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
#MultiIndex(levels=[['a', 'b'], [1, 2]],
# labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
```
:+1: 上面的objects都可以在創造`Series` `DataFrame`時放入 index argument 中,或者以`reindex` method用於已經存在的`Series` `DataFrame`中
### MultiIndex level names
> 將index命名
```python=
pop.index.names = ['state', 'year']
pop
# state year
#California 2000 33871648
# 2010 37253956
#New York 2000 18976457
# 2010 19378102
#Texas 2000 20851820
# 2010 25145561
#dtype: int64
```
### MultiIndex for columns
> `DataFrame`的column也能有multiple level
```python=
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
```

> 資料共有4個維度the subject, the measurement type, the year, and the visit number.
>
> 用index取出個人的所有資訊
> With this in place we can, for example, index the top-level column by the person's name and get a full DataFrame containing just that person's information:
```python=
health_data['Guido']
# type HR Temp
#year visit
#2013 1 32.0 36.7
# 2 50.0 35.0
#2014 1 39.0 37.8
# 2 48.0 37.3
```
## Indexing and Slicing a MultiIndex
Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions. We'll first look at indexing multiply indexed Series, and then multiply-indexed DataFrames.