# **WEEK 3** # **Ngay 2/8/2021 PANDAS** Data analyst 1. About Pandas: truoc khi muon xay dung ML --> Phai hieu duoc data **Bao gom 5 buoc:** 1. Define question (se co nhung question khong co data) 2. Collect data 3. Clean data (quan trong va ton thoi gian nhat) 4. EDA (explore data analyst) 5. Report Data pipeline: ![](https://i.imgur.com/gAaNIMU.png) google data studio ![](https://i.imgur.com/leCHyQj.png) Trong Pandas co 3 loai data: ![](https://i.imgur.com/IMKwD57.png) - Dataframe: chua nhieu dong, nhieu cot - Series: 1 dong or 1 cot duoc trich ra tu dataframe. (Mot series chi co 1 kieu data) - Index (phan in dam, khong phai la cot) giong nhu dia chi **Import libraries:** ``` import pandas as pd df = pd.read_csv (....): doc file csv ``` **Y nghia cua mot so ham:** 1. Set new index ``` df = df.set_index('Country Name') ``` 2. Show the first 5 rows, last 5 rows, 5 random rows ``` df.head() df.tail() df.sample(5) ``` 3. Show shape of the dataframe --> Hien thi ra tong so dong va tong so cot ``` df.shape ``` 4. Show info --> hien thi thong tin cua bang: ``` df.info() ``` ![](https://i.imgur.com/Jh4IBAK.png) 5. Overview of numerical columns ``` df.describe() ``` ![](https://i.imgur.com/2jMBSZG.png) 6. Choose a column --- Selection trong Pandas -- ``` df['Income Group'] ``` 7. Choose one row ``` df.loc['Vietnam'] ``` .loc --> giong nhu dictionary 8. Select the whole row --- integer loc ``` df.iloc[3] ``` .iloc --> giong nhu list, duoc danh so cho cac cot, cac dong. Khi viet 1 dong ra se tu dong chuyen sang dang string. 9. Cot index khong duoc danh so. Do do khi chon index thi su dung ham sau: ``` df.index --> index dong df.columns --> index cot ``` 10. Selection in Series --> co the chon index truc tiep ma khong can .loc hay .iloc Trong List khong co cac phep toan: +, -, *, / **Loc data (Filter)** Selection <=> [] ![](https://i.imgur.com/IVDIpq9.png) ``` df[(df['Birth rate'] > 20) & (df['Internet users'] < 50)] df[(df['Birth rate'] > 20) | (df['Internet users'] < 50)] ``` Luu y: and <=> & ; or <=> | ; moi mot dieu kien nam trong ngoac tron () - Trong Python co tinh ke thua, do do dong/lenh tiep theo se lay ket qua cua dong/lenh truoc do **Sort (sap xep du lieu)** 1. Sort value ``` df = df.sort_values('Internet users', ascending=False) ``` --> gia tri tu cao den thap do "ascending=Felse" 2. Sort index ``` df.sort_index() ``` -------------------------------------- CAC BUOC DE MOUNT GOOGLE DRIVE VAO COLAB ``` from google.colab import drive drive.mount('/content/gdrive') ``` ``` import pandas as pd ``` ``` df = pd.load_csv('/content/gdrive/MyDrive/demographic_data.csv') ``` ------------------------------------------ # **Ngay 3/8/2021** PANDAS (TIEP THEO) 1. Ham Groupby --> dung de nhom cac gia tri trong cot lai voi nhau ``` # Tinh trung bình tip theo ngay # --> group theo cột category nào --> chọn value cần tính --> method df.groupby('day')['tip'].mean() ``` Co the .groupby nhieu cot ``` df.groupby(['day', 'sex'])['tip'].mean() ``` Luu y: * Ngoặc vuông []: -->su dung cho select, co nhieu hon 1 cot * Ngoac tron: --> hàm -- groupby(), mean(), sum(), .info(), sort_values() * Ngoac nhon {}: --> (dictionary) 2. .agg['mean', 'sum'] --> duoc su dung de tinh nhieu gia tri cung 1 lan ``` ls_gp = ['tip', 'total_bill'] df.groupby('day')[ls_gp].agg(['mean', 'sum']) ``` ![](https://i.imgur.com/vbJggZz.png) Hoac co the dung dictionary để assign từng method cho từng value ``` df.groupby('day')[['tip', 'total_bill']].agg({'tip':'sum', 'total_bill':'mean'}) ``` - .groupby se tra gia tri ve list 3. Pivot table --> duoc su dung de lam bang tong hop cua data ``` pd.pivot_table(data=df, index='sex', columns='day', values='total_bill', aggfunc='mean', margins=True) ``` ![](https://i.imgur.com/XK7SQA0.png) - De thuc hien Pivot table co nhieu value ``` pd.pivot_table(data=df, index='day', columns='sex', values=['tip', 'total_bill'], aggfunc='mean') ``` ![](https://i.imgur.com/ed3psX5.png) Neu them margins=True thi Pandas se tu dong them 2 cot total cua dong va cot. --> Tuy nhien, thong thuong nen groupby it cot thoi vaf de cho de nhin. 4. Function (ham) ``` # Function nhận parameter là từng row def check_tip(row): if row['tip'] > 3: return 'High' else: return 'Low' ``` 5. Apply --> duoc su dung de apply ham vao dataframe hoac vao value o series Ham apply duoc su dung khi muon luot qua tung dong va apply data vao tung dong. (han che su dung for loop vi thoi gian lau => su dung ham apply de thay the) ``` Write into the table (tạo 1 cột mới trong bảng) df['TipType'] = df.apply(check_tip, axis=1) ``` * exis = 1 --> apply len tung dong * exis = 0 --> apply len tung cot ![](https://i.imgur.com/ftBKb8q.png) , ![](https://i.imgur.com/wdVzisV.png) . return -> dung de luu bien de su dung sau nay . print -> chi su dung de the hien len be mat, sau nay khong goi lai duoc. -------------------------------------------------- # **Ngay 4/8/2021 CLEAN DATA** Trong baif pha tich data co 5 buoc: 1. Define question 2. Collect data 3. Clean data --> important 4. EDA 5. Report **CLEAN DATA** - Co can lay het data khong? - Check loai data co dung hay chua - Co missing value nao - Co duplicated khong - Co data nao bat hop ly khong - Xem trong data co gia tri nao vuot khoi gia tri cua minh dua ra khong. **Vi du cac buoc :** 1. Tao bang copy truoc khi lam: df2 = df.copy() 2. Xoa cot: df2.drop(comlumn=['Pclass']) 3. Xoa dong: df2.drop[index[1,2,3], inplace = True] ``` df2.drop(columns=['Cabin', 'Embarked'], inplace=True) # Cach 1 df2 = df2.drop(columns=['Cabin', 'Embarked']) #Cach 2 ``` Luu y: * Inplace = True: thuc hien ngay lap tuwc, khong phai gan bien * Inplace = False --> phai gan bien 4. Thay doi ten cot: df2 = df2.rename() ``` # Specific column : {Tên cũ: Tên mới} df2 = df2.rename(columns={'Name':'PassengerName', 'SibSp':'NumberSibling'}) ``` Neu muon doi het ten cot: ``` # Đổi hết tên columns # Syntax trả về tên các columns df2.columns = ['A', 'B', 'C', 'D', 'E', 'F', '7', '8', '9', '10', '11', '12'] ``` --> co bao nhieu cot thi trong list phai co du 5. Check data type: df.info() Neu muon doi data type: ``` df['Age'] = df['Age'].astype('float') ``` Tuy nhien, neu trong cot ma co num value thi se khong doi duoc --> phai xu ly num value truoc. 6. Xu ly Missing value: --> do nguoi dung khong dien vao Truoc khi xu ly, dat cau hoi xem cot do co dung hay khong: No --> bo cot Yes --> xu ly nhu sau: * Missing data < 10% --> drop, delete di hoac fill in bang 1 gia tri nao do * Xem co mot so cot lien quan den missing value do * Dung ML de predict **Detect missing value:** Tao ra dataframe --> data = pd.DataFrame * data.isna() --> tra ve cho nao co non value -> tra ve True * data.isna().sum() --> co duoc so luong non value trong tung cot * data[data['colume1'].sna()] -> de show cac dong co non value * data.dropna(axis=0) --> xoa non value (0: xoa dong, 1: xoa cot) ===> luu y gan lai bien de thay doi gia tri trong data * data.fillna(100) --> dien so bat ky vao cho trong (fill duoc string) --> data se chuyen ve object ==> khong nen dien string vao cot so vi se khong tinh toan duoc ==> truoc khi xoa nen tinh ti le xem co nen xoa hay khong. VD: ``` # Tỷ lệ số dòng Age Nan với dataset ban đầu df[df['Age'].isna()].shape[0] / df.shape[0] ``` **Nhung data bi Duplicated** * df.duplicated().sum() --> tra ve True hoac False cung co the su dung tren 1 cot * df['name'].duplicated().sum() * Show nhung dong bi duplicated/ not duplicated ``` # Create a temporary table for study (.concat() cong vao df 5 them 5 dong dau tien) temp = pd.concat([df, df.head()], axis=0) temp # dau ~ <=> not : show nhung dong khong bi trung temp[~temp.duplicated()] ``` **mislabeled va corrupted data** Nhung data sai ve logic thi khong co mot cong thuc tong quat ==> phai kiem tra tung cot su dung 2 nhom syntax: ![](https://i.imgur.com/OH6d7pg.png) * Doi voi Continuous cot df.describe() * Doi voi Categorical cot df['Pclass'].unique() **Descriptive Statistics, detect and handle outliers (thong ke mo ta, phat hien va xu ly cac ngoai le)** ``` # Compute a descriptive statistics review df.describe() ``` ![](https://i.imgur.com/tIAMgTH.png) ![](https://i.imgur.com/NfIGMM8.png) Khi su dung ham .describe() se co cac gia tri: * Descriptive statistics: nhanh thong ke lien quan den viec dien ta * Inferential Statics: - Centra tendecy: data hoi tu tai diem nao - Spread dispresion: data gian (keo) tu dau den dau (su phan bo, tan mat) **Meansure Central Tendency:** * Mean: gia tri trung binh * Median: Gia tri chinh giua * Mode: gia tri lap lai nhieu lan nhat **Spread** ![](https://i.imgur.com/HNFnmOh.png) Quantile: mo ta su trai rong cua data ![](https://i.imgur.com/fB9ikay.png) ![](https://i.imgur.com/T9p678F.png) De nhan dien Outlines co 3 buoc: ![](https://i.imgur.com/KQfr7Cf.png) Vi du: ``` df['Fare'].describe() # Calculate the quantiles -- Tính Q1 và Q3 q1 = df['Fare'].quantile(0.25) # Tính Q1 q3 = df['Fare'].quantile(0.75) # Tính Q3 # Calculate the interquantile range iqr = q3 - q1 # Calculate the whisker upper = q3 + 1.5*iqr lower = q1 - 1.5*iqr # Normal df[(lower < df['Fare']) & (df['Fare'] < upper)] # Abnormal -- outliers abnormal = df[(lower > df['Fare']) | (df['Fare'] > upper)] # nhugng hanh khach tra nhieu tien abnormal.groupby('Sex')['Survived'].mean() ``` --------------------------------------- # **Ngay 5/8/2021 DATA VISUALIZATION (Truc quan hoa du lieu)** * Neu data khong co nhieu su khac nhau: mo ta chi tiet truoc, sau do den tong quat. * Neu data co su khac biet : mmo ta tong quat truoc sau do den chi tiet ==> chu y nen quan tam den mau sac cua bieu do **Visualizing with Seaborn** ``` import pandas as pd import seaborn as sns ``` 1. sns.countplot https://seaborn.pydata.org/generated/seaborn.countplot.html ``` # How many country are there in each Income Group sns.countplot(data=df, x='Income Group') ``` ![](https://i.imgur.com/mzHEpEu.png) 2. sns.barplot https://seaborn.pydata.org/generated/seaborn.barplot.html ``` # Lấy thông tin: Internet user theo Income Group plot_data = df.groupby('Income Group')['Internet users'].mean().reset_index() ``` ``` sns.barplot(data=plot_data, x = 'Income Group', y = 'Internet users') ``` ``` # Option 2: Change in the seaborn syntax sns.barplot(data=plot_data, x='Income Group', y='Internet users', color='green', order=['Low income', 'Lower middle income', 'Upper middle income', 'High income']) ``` ![](https://i.imgur.com/vAz153u.png) 3. sns.lineplot https://seaborn.pydata.org/generated/seaborn.lineplot.html ``` # Line chart to compare average Internet users rate between Income Group # Step 1: Get the plot data plot_data = df.groupby('Income Group').mean()[['Internet users']].reset_index() # Step 2: Plot sns.lineplot(data=plot_data, x='Income Group', y='Internet users') ``` ![](https://i.imgur.com/tFX44Lo.png) 4. sns.scatterplot https://seaborn.pydata.org/generated/seaborn.scatterplot.html ``` # Visualize the correlation between Internet rate and Birth rate # Here we can breakdown dimension using the parameter Hue = "column_name" sns.scatterplot(data=df, x='Internet users', y='Birth rate', hue='Income Group') ``` ![](https://i.imgur.com/y55OeNz.png) 5. sns.histplot https://seaborn.pydata.org/generated/seaborn.histplot.html ``` # Visualize the distribution of Internet rate sns.histplot(data=df, x='Birth rate') ``` ![](https://i.imgur.com/ZiMtQMr.png) 6. sns.boxplot https://seaborn.pydata.org/generated/seaborn.boxplot.html ``` # Visualize the distribution of Internet rate sns.boxplot(data=df, x='Birth rate') ``` ![](https://i.imgur.com/RnxE2p4.png) 7. sns.pairplot ``` sns.pairplot(df) ``` ![](https://i.imgur.com/43xujTt.png) --------------------------------------------- **Mot so luu y khi su dung bieu do:** Doi voi du lieu so, su dung: * Histogram * Boxplot * Countplot (co the su dung voi category) ==> Doi voi bieu do noi ve ti le, chi su dung loai Pie (hinh tron) khi co it Category, vaf khi category do chiem da so so voi cac loai con lai. ==> Neu co nhieu category qua thi nen su dung Barplot ==> Nen su dung bieu do line neu 1 trong 2 cot la time hoac continuous ==> neu 1 cot la so, 1 co la category --> s/d bar char ==> boxplot duoc su dung vi du nhu luong khac nhau giua cac phong ban ![](https://i.imgur.com/K6VzG9W.jpg) --------------------------------------------- # **Ngay 6/8/2021 MATPLOTLIB** Day la thu vien cho Data Visualization trong python Seaborn duoc viet tren Matplotlib * Mot so thuat ngu ![](https://i.imgur.com/uubW0tr.jpg) ![](https://i.imgur.com/ilTs3go.png) ``` import pandas as pd import seaborn as sns import matplotlib.pyplot as plt ``` 1. FIGURE AND SUBPLOTS 👉 To create figure and subplots: plt.figure() ▸ https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html plt.subplot() ▸ https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot.html * plt.figure(chieu rong, chieu cao) --> se tu dong tao ra mot figure & 1 subplot * cu phap: ``` plt.figure(figsize=(20, 6)) # rộng - cao theo inches (max 30) sns.countplot(data=df, x='day') plt.show() ``` 2. MULTIPLE SUBPLOTS IN ONE FIGURE ``` plt.figure(figsize=(20, 6)) # rộng - cao theo inches (max 30) plt.subplot(121) sns.countplot(data=df, x='day') plt.subplot(122) sns.countplot(data=df, x='day') plt.show() ``` **Adding titles** Now that we have all the graphs needed. Let's give it a name. 👉 To add title: plt.suptitle() ▸ https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.suptitle.html plt.title() ▸ https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.title.html ``` plt.figure(figsize=(10, 7)) # Plot on first subplot plt.subplot(121) sns.countplot(data=df, x='day') plt.subplot(122) sns.countplot(data=df, x='day') plt.show() ``` 3. FORMATTING AXIS & STYLING * plt.xlabel / plt.ylabel --> dat ten nhan x, y ``` plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day') plt.xlabel("Days of Week") plt.ylabel("Number of bill") plt.show() ``` * plt.xticks / plt.yticks --> chia lai khoang x, y ``` Working with continuous axis: plt.figure(figsize=(5,5)) sns.histplot(data=df, x='total_bill') plt.xticks(ticks=[0, 5, 6, 7, 8, 10, 15, 20]) #Tell them which tick position do you want to display? plt.show() ``` hoac ``` #Working with categorical axis: plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day', order=['Thur', 'Fri', 'Sat', 'Sun']) # Using plt.xticks or plt.yticks to change the ticks plt.xticks(ticks=[0, 2, 3], labels=['Thursday', 'Saturday', 'Sunday']) plt.show() ``` * plt.xlim / plt.ylim --> thay doi khoang chia truc ``` plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day', order=['Thur', 'Fri', 'Sat', 'Sun']) # Using plt.ylim to limit or expand the range of the y axis. plt.ylim(0, 100) plt.show() ``` * plt.twinx() --> them truc y khi ve 2 bieu do tren cung 1 hinh ``` plt.figure(figsize=(8, 8)) sns.countplot(data=df, x='day') plt.ylabel('Number of bills') plt.twinx() sns.lineplot(data=plot_data, x='day', y='tip') plt.ylabel('Average Tip') plt.show() ``` * Styling ``` # Set up the theme plt.style.use('default') plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day') plt.show() ``` * ADDING TEXT plt.text( ) ▸ https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.text.html Some important parameters: x: x location y: y location s: message in string horizontalalignment or ha: "center", "right", or "left" verticalalignment or va: 'center', 'top', 'bottom', 'baseline', 'center_baseline' Furthermore, any other formatting parameter can be passed in as a dictionary. Please refer in the link. ``` plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day') plt.text(x=0, y=80, s="This is a string", ha='center', color='pink', fontsize='large') plt.show() ``` **PANDAS BUILT-IN PLOT FUNCTION .plot( )** ![](https://i.imgur.com/9R4ID1Q.png) ![](https://i.imgur.com/Zko60Da.png) ![](https://i.imgur.com/n4zfQH7.png) ![](https://i.imgur.com/TTYR9Vd.png) ![](https://i.imgur.com/1ju8oux.png) ![](https://i.imgur.com/x2Y6E90.png)