# 數據分析學習筆記
:::success
[TOC]
:::
### Deal with missing data
:::info
```python=
avg_x=df[].mean()
df[].replace(np.nan),avg_x,inplace=True)
:::
### Data Standardization
:::info
```python=
#1.convert
df['a']=df['a']/100
#2.
df.rename(columns={'a':'new'},inplace=True)
:::
### Data Normalization
:::info
```python=
#1.Min-Max
df[]-df[].min()
-----------------------
df[].max()-df[].min()
#2.Simple Feature
df[]
----------
df[.max()]
#3.Z-score
df[]-df[].mean()
----------------
df[].std()
:::
### Binning
:::info
```python=
#把salary分群 Low,Medium,High
bins=np.linspace(df['salary'].min(),df['salary'].max(),4)
#範圍設定4是因爲設定三個區段(Low,Medium,High)
label_column={'Low','Medium','High'}
df['salary']=pd.cut(df['salary'],
bins,label=label_column,include_lowest=True)
:::
### Get Dummy
:::info
```python=
#把Gender=Male/Female 轉換成數值表示,才能做統計分析
#先建立函數,把條件輸入進去
dummy1=pd.get_dummies(df['gender'])
#然後合并至目的dataframe
df=pd.concat([df,dummy1],axis=1)
:::
:::danger
<font color ="red">*Categorical Data 非Numeric Data(ex:IDName,Gender,Vehicle_Type,etc.)</font>
:::
### 簡單的圖表作用
:::info
```python=
#Boxplot 找出各層階段值的分佈
sns.boxplot('x','y',data=df)
#Scatterplot 找出x與y的影響關係是Positive/Negative
#其中一個可以是categorical data(x,y)
plt.scatter('x','y',data=df)
plt.tittle('')
plt.xlabel('')
plt.ylabel('')
:::
:::danger
<font color ="red">*用boxplot是因爲其中一個variable是categorical variable!!!兩者都是Numeric的話才能用regplot!!!</font>
:::
## EDA(Exploratory Data Analysis)
### 1. Descriptive Statistic
:::info
```python=
df.describe()
df[].value_counts()
*df[].value_counts().to_frame()
boxplot
scatterplot
:::
### 2. Groupby
:::info
```python=
#Step 1:把categorical data(a,b)放前面,剩下的numeric(c)放後面做計算
df_abc.groupby(['a','b'],as_index=False)['c'].mean()
#Step 2:轉換成pivot table
df_abc.pivot(index='a',columns='b')
#Step3:Heatmap
plt.pcolor(df_abc.pivot,'RdBu')#red&blue
plt.colorbar
plt.show()
:::
### 3. Correlation
:::info
What is correlation?
If both variables are interdepent,they are definited as correlated.It does not imply causation.
Ex:
Lung Cancer<->Smoking
Rain<->Umbrella
我們不能斷定是因爲下雨才帶傘,只能評述兩者有關聯性
:::
#### Positive Linear Relationship
:::info
```python=
sns.regplot('x','y',data=df)
plt.ylim(0,) #'y'=Price 設定零開始
#曲綫圖為'/'
:::
#### Negative Linear Relationship
:::info
```python=
sns.regplot('x','y',data=df)
plt.ylim(0,) #'y'=Price 設定零開始
#曲綫圖為'\'
:::
#### Pearson Correlation
:::info
1.Correlation Coefficient
2.P-value
兩者都是measure the strength between two features
\
Correlation Coefficient相關係數
If 相關係數close to 1:Large Positive
相關係數close to -1:Large Negative
相關係數close to 0:No Relation
\
P-value
驗證corr coef的值是否值得相信
1.P<0.001:strong certainty of corr coef
2.0.001<P<0.05:Moderate
3.0.05<P<0.1:weak
4.P>0.1:No certainty
\
***If Pearson Correlation is strong:
1.Corr Coef is close to 1 or -1
2.P-value<0.001
```python=
#計算corr coef,p-value
import scipy.stats as st
pearson_coef,p-values=st.pearsonr(df['a'],df['b'])
:::
### 4. Anova
:::info
1.在一個categorical data找出各組別的相關性
2.Anova的值可用F-test和P-value來驗證
\
F-test:計算樣本内各組mean的差異比率
```python=
df_anova=df(['Cartype','Price'])
group_anova=df_anova.groupby('Cartype')
result=st.f_oneway(group_anova.get_group('Honda')['Price']
,group_anova.get_group('Toyota')['Price'])
```
If F-score is small, 'x'和'y'分佈應該是||,代表weak corr
If F-score is large ,'x'和'y'的分佈應該是',,代表strong corr
:::
:::danger
<font color ="red">***Pearson Corr是計算兩個變數的corr
\
***Anova是計算一個變數裏each category的corr</font>