pandas Foundations

tags: `Datacamp` `python` `panda` `data science` `Data Manipulation with Python`

作者:何彥南
Datacamp 課程: pandas Foundations

注意:

df 為 pandas 的 DataFrame 的縮寫。
pd 為 panda 套件的縮寫。
請以官方文件 panda doc 為主。
注意panda 的版本，有些功能可能在新版無法使用。
程式碼內#標記的地方為 output。

pandas Foundations
[CH1] Data ingestion & inspection
[CH2] Exploratory data analysis
[CH3] Time series in pandas

[CH1] Data ingestion & inspection

1.What is pandas?

有以下幾點特性

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

pandas DataFrames

這邊我們使用(AAPL.csv) 的資料做基本介紹。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

panda 最基礎的資料型態就是 DataFrame，簡單來說就是一個table。

Indexes and columns | shape、columns、index

在資料分析前，最重要的就是觀察資料長怎樣，而panda提供了很完善的function。

首先由index和column，他們是組成dataframe的基礎。

index: 由左至右，也就是row的標籤，他代表那一列資料。
column: 由上而下，簡單來說每行資料就是一個變數，代表每筆資料的其中一個特徵。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

type(): 可以看到它的資料類型，是panda下的dataframe。
df.shape: 可以知道df的(列,行)，也就是有幾筆資料和變數。
df.columns: 可以知道有那些columns，輸出是index 型態的每個行名(column name)。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

df.index: 會輸出所有列標籤(index)的列表。

Slicing | iloc[ ]、head()、tail()、info()

在panda 裡面我們可以輕易地對資料進行切分。

df.iloc[row_index, column_index] (doc)
- 我們可以透過設定 row 和 column 的 index(位置)來獲取我們想要的部分資料。
- 其中第一個就是 0 最後一個就是 -1，而 : 就是所有的意思
這邊我們示範如何立用iloc[] 抓取前五筆和後五筆資料。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

下面 head() 和tail()可以達到一樣的效果。

df.head()

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

df.tail()

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

我們還可以使用 info()看到更多資料。

df.info()

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Broadcasting | iloc[ ::3，-1]

廣播功能，就是對多個值做一樣的動作的意思。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

這邊我們在 iloc[] 裡面使用 ::3 他代表的就是抓取每第三個資料(可被3除，包含0)
下面我們嘗試使用這個特性對多個值進行 np.nan 的動作。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

使用 info() 我們可以清楚的看到在 ADJ Close 這行我們填入了兩千多筆nan。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Series | values

Series 和 dataframe 在panda裡是兩個主要的資料型態，當你指定dataframe 裡的一行時他就是 Series 的型態。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

df.values : 能將 dataframe 轉成 numpy 的 array 形式

2. Building DataFrames from scratch

DataFrames from CSV files | read_csv()

首先我們從載入資料開始

df.read_csv() (panda doc)
除了 csv檔以外 panda 還支援取多其他的資料格式。

DataFrames from dict | pd.DataFrame()、zip()、list()、dict()

從 dict 轉入 dataframe

zip(): 可以交多個 list 合併。

接著我們用 dict() 將其轉為 dict 的格式，並匯入dataframe。

Broadcasting | df[column]、df.column

也可以使用 df.column_name
df[['column_name','column_name']] 可以同時呼叫多行。

Broadcasting with a dict | index=[ ]、columns=[ ]

我們也可以利用dict的匯入dataframe時的特性，獲取一行一樣的值。

我們可以使用df.columns和df.index 直接設定 index_lable 和 column_lable。

3.Importing & exporting data

Datasets from CSV files

範例資料 ISSN_D_tot.csv

Using header keyword | read_csv(header=None)

當資料沒有表頭的時候，設定header=None，column 的名稱就會變成數字(0、1、2 …)

Using names keyword | read_csv(names=[ ])

設定 names 這個參數，可以以list的格式設定column_name。

Using na_values keyword | read_csv(na_values=' ')

設定 na_values 參數，可以在載入資料時，將指定值改成nan。

也可以只鎖定定行做更改。

Using parse_dates keyword |read_csv(parse_dates=[ ])、index=df[column]

使用 parse_dates 參數時，會將list裡指定的行合併載入成datetime的形式。

注意必須是可解析的資料

這邊我們可以看到載入的資料格式是datetime64

使用df.index 可以指定行為index

Trimming redundant columns | df[[col1，col2]]

用 list 的方式可以抓取多行

Writing files | to_csv()

最後是資料存取的部分

df.to_csv() panda doc

4. Plotting with pandas (用 panda 和 matplotlib 作圖)

AAPL stock data | plot()

這邊我們使用(AAPL.csv) 的資料做基本介紹。

視覺化:
- matplotlib.pyplot.plot() doc
- df.plot() doc

Plotting arrays (matplotlib) | plt.plot(df[column])、plt.show()

這邊是使用 values 轉成 arrays的格式，在繪圖。

也可以直接使用series的格式

可以直接使用dataframe的格式

Plotting Series (pandas) | Series.plot()

panda 也可以直接使用series.plot()

也支援 dataframe的格式

Fixing scales | plt.yscale()

還有許多客製化的設定

Saving plots | plt.savefig(' ')

儲存圖片

[CH2] Exploratory data analysis

1. Visual exploratory data analysis (各種視覺化圖形)

The iris data set

這邊我們使用iris.csv這個資料，介紹不同的視覺化圖案。

Data import

載入資料

Line plot | plot(x=,y=)

panda 的 plot()可以直接指定兩行為x、y 軸，注意: 每個點代表每一筆資料。

Scatter plot

散佈圖

Box plot

箱型圖

Histogram | plot(bins=,range=,normed=)

直方圖

也可以使用 plot.hist() 的方式

各種參數

bins (integer): number of intervals or bins
range (tuple): extrema of bins (minimum, maximum)
normed (boolean): whether to normalize to one
cumulative (boolean): compute Cumulative Distribution Function (CDF)
… more Matplotlib customizations

Word of warning | plot(kind=‘hist’)、plt.hist()、hist()

還有許多方式

Three different DataFrame plot idioms
- iris.plot(kind=‘hist’)
- iris.plt.hist()
- iris.hist()
Syntax/results differ!
Pandas API still evolving: check documentation!

2. Statistical exploratory data analysis (統計分析)

Summarizing with describe() | describe()

使用describe()可以看到每一行數值型資料的各種基本統計資料。

count: number of entries
mean: average of entries
std: standard deviation
min: minimum entry
25%: first quartile
50%: median or second quartile
75%: third quartile
max: maximum entry

Counts | count()

count() 可以數有幾個。

Averages | mean()

mean() 可以知道平均

Standard deviations | std()

std() 可以知道標準差

Medians | median()

median() 可以知道中位數

Medians & 0.5 quantiles | quantiles()

中位數相當於 0.5 quantiles(四分位數)

使用　quantiles（）也可以抓取四分位數的相對資料

Ranges | max()、min()

max() 和 min()

Box plots

盒鬚圖

Percentiles as quantiles

describe() 下也有百分位數和四分位數的資料。

3. Separating populations

Describe species column

也可以只對單行用 describe()觀察

Unique & factors | unique()

unique() 可以知道該行有哪些不同的值

這邊我們知道在species 這行裡面有三種不一樣的分類。

依特定條件選取篩選dataframe | df[column]==' '

分成三個df之後，我們會發現 unique() 的結果都只又一個值。

從 index 地方我們可以發現，每筆資料在原本表的位置。

Visual EDA

這邊我們將所有資料一起視覺化，會顯得有些雜亂。

於是我們將資料分開來視覺化，看起來就好多了。

使用 describe()，我們也可以將不同品種分開來觀察，在各種特徵上的統計數值。

Computing errors

Viewing errors

[CH3] Time series in pandas

1. Indexing time series | 時間序列索引做為索引

Using pandas to read datetime objects

read_csv() function
- 接受輸入字串轉datetime
- 需要可以被datetime解析
- 當有無法解析的時候，請read_csv() 之後再使用 pd.to_datetime 進行轉換。
ISO 8601 format
- yyyy-mm-dd hh:mm:ss

Product sales CSV

這邊我們使用 Product sales.csv

Parse dates | read_csv(parse_dates=,index_col=)、loc[ ]

在使用 read_csv() 時我們可以直接使用，parse_dates 去讀取時間資料，並將時間資料設為index。

下面我們可以看到，將date 設為index 後的狀態。

將date 設為index 後我們就可以直接用 loc[]去抓取特定的資料

loc[ ]: 是使用 lable 就是以名字為主，通常為字串。
iloc[ ]: 則是使用 index 也就是位置，為數字。

Partial datetime string selection | datetime 資料的選取

datetime 型態下的資料可以用以下方是選取特定資料。

Alternative formats:
- sales.loc[‘February 5, 2015’]
- sales.loc[‘2015-Feb-5’]
Whole month: sales.loc[‘2015-2’]
Whole year: sales.loc[‘2015’]

選取整個月

也可以選與一段時間內的資料

Convert strings to datetime

使用 pd.to_datetime() doc

Reindexing DataFrame | reindex()

df.reindex() doc

他可以更改dataframe 的 column_index 或 row_index(預設)
下面我們將前面的evening_2_11 做為新的 index

此外reindex() 還有填補空值的功能。ffill 就是使用前一個有觀測值來填補。而bfill 就是使用下一個有效的觀測值來填補。

2. Resampling time series data

Sales data

這邊我們使用 Product sales.csv

Resampling (重新取樣)

Statistical methods over different time intervals
- mean(), sum(), count(), etc.
Down-sampling
- reduce datetime rows to slower frequency
Up-sampling
- increase datetime rows to faster frequency