數據分析學習筆記

# 數據分析學習筆記 :::success [TOC] ::: ### Deal with missing data :::info ```python= avg_x=df[].mean() df[].replace(np.nan),avg_x,inplace=True) ::: ### Data Standardization :::info ```python= #1.convert df['a']=df['a']/100 #2. df.rename(columns={'a':'new'},inplace=True) ::: ### Data Normalization :::info ```python= #1.Min-Max df[]-df[].min() ----------------------- df[].max()-df[].min() #2.Simple Feature df[] ---------- df[.max()] #3.Z-score df[]-df[].mean() ---------------- df[].std() ::: ### Binning :::info ```python= #把salary分群 Low,Medium,High bins=np.linspace(df['salary'].min(),df['salary'].max(),4) #範圍設定4是因爲設定三個區段(Low,Medium,High) label_column={'Low','Medium','High'} df['salary']=pd.cut(df['salary'], bins,label=label_column,include_lowest=True) ::: ### Get Dummy :::info ```python= #把Gender=Male/Female 轉換成數值表示，才能做統計分析 #先建立函數，把條件輸入進去 dummy1=pd.get_dummies(df['gender']) #然後合并至目的dataframe df=pd.concat([df,dummy1],axis=1) ::: :::danger *Categorical Data 非Numeric Data(ex:IDName,Gender,Vehicle_Type,etc.) ::: ### 簡單的圖表作用 :::info ```python= #Boxplot 找出各層階段值的分佈 sns.boxplot('x','y',data=df) #Scatterplot 找出x與y的影響關係是Positive/Negative #其中一個可以是categorical data(x,y) plt.scatter('x','y',data=df) plt.tittle('') plt.xlabel('') plt.ylabel('') ::: :::danger *用boxplot是因爲其中一個variable是categorical variable!!!兩者都是Numeric的話才能用regplot!!! ::: ## EDA(Exploratory Data Analysis) ### 1. Descriptive Statistic :::info ```python= df.describe() df[].value_counts() *df[].value_counts().to_frame() boxplot scatterplot ::: ### 2. Groupby :::info ```python= #Step 1:把categorical data(a,b)放前面，剩下的numeric(c)放後面做計算 df_abc.groupby(['a','b'],as_index=False)['c'].mean() #Step 2:轉換成pivot table df_abc.pivot(index='a',columns='b') #Step3:Heatmap plt.pcolor(df_abc.pivot,'RdBu')#red&blue plt.colorbar plt.show() ::: ### 3. Correlation :::info What is correlation? If both variables are interdepent,they are definited as correlated.It does not imply causation. Ex: Lung Cancer<->Smoking Rain<->Umbrella 我們不能斷定是因爲下雨才帶傘，只能評述兩者有關聯性 ::: #### Positive Linear Relationship :::info ```python= sns.regplot('x','y',data=df) plt.ylim(0,) #'y'=Price 設定零開始 #曲綫圖為'/' ::: #### Negative Linear Relationship :::info ```python= sns.regplot('x','y',data=df) plt.ylim(0,) #'y'=Price 設定零開始 #曲綫圖為'\' ::: #### Pearson Correlation :::info 1.Correlation Coefficient 2.P-value 兩者都是measure the strength between two features \ Correlation Coefficient相關係數 If 相關係數close to 1:Large Positive 相關係數close to -1:Large Negative 相關係數close to 0:No Relation \ P-value 驗證corr coef的值是否值得相信 1.P<0.001:strong certainty of corr coef 2.0.001<P<0.05:Moderate 3.0.05<P<0.1:weak 4.P>0.1:No certainty \ ***If Pearson Correlation is strong: 1.Corr Coef is close to 1 or -1 2.P-value<0.001 ```python= #計算corr coef,p-value import scipy.stats as st pearson_coef,p-values=st.pearsonr(df['a'],df['b']) ::: ### 4. Anova :::info 1.在一個categorical data找出各組別的相關性 2.Anova的值可用F-test和P-value來驗證 \ F-test:計算樣本内各組mean的差異比率 ```python= df_anova=df(['Cartype','Price']) group_anova=df_anova.groupby('Cartype') result=st.f_oneway(group_anova.get_group('Honda')['Price'] ,group_anova.get_group('Toyota')['Price']) ``` If F-score is small, 'x'和'y'分佈應該是||,代表weak corr If F-score is large ,'x'和'y'的分佈應該是'，,代表strong corr ::: :::danger ***Pearson Corr是計算兩個變數的corr \ ***Anova是計算一個變數裏each category的corr