Virgil - Data Cleaning - S11 Determine Important Features

--- title: Virgil - Data Cleaning - S11 Determine Important Features tags: Virgil, LearnWorld, DataCleaning --- <a target="_blank" href="https://colab.research.google.com/drive/1GEtYEtHp7wIywIdCv1ffJq704hRoMDf_"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a> ``` import pandas as pd ``` ## ❊ Overview ``` df = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/titanic.csv') ``` ``` df.sample(10) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>Name</th> <th>Sex</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>619</th> <td>620</td> <td>0</td> <td>2</td> <td>Gavey, Mr. Lawrence</td> <td>male</td> <td>26.00</td> <td>0</td> <td>0</td> <td>31028</td> <td>10.5000</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>514</th> <td>515</td> <td>0</td> <td>3</td> <td>Coleff, Mr. Satio</td> <td>male</td> <td>24.00</td> <td>0</td> <td>0</td> <td>349209</td> <td>7.4958</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>496</th> <td>497</td> <td>1</td> <td>1</td> <td>Eustis, Miss. Elizabeth Mussey</td> <td>female</td> <td>54.00</td> <td>1</td> <td>0</td> <td>36947</td> <td>78.2667</td> <td>D20</td> <td>C</td> </tr> <tr> <th>490</th> <td>491</td> <td>0</td> <td>3</td> <td>Hagland, Mr. Konrad Mathias Reiersen</td> <td>male</td> <td>NaN</td> <td>1</td> <td>0</td> <td>65304</td> <td>19.9667</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>198</th> <td>199</td> <td>1</td> <td>3</td> <td>Madigan, Miss. Margaret "Maggie"</td> <td>female</td> <td>NaN</td> <td>0</td> <td>0</td> <td>370370</td> <td>7.7500</td> <td>NaN</td> <td>Q</td> </tr> <tr> <th>50</th> <td>51</td> <td>0</td> <td>3</td> <td>Panula, Master. Juha Niilo</td> <td>male</td> <td>7.00</td> <td>4</td> <td>1</td> <td>3101295</td> <td>39.6875</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>388</th> <td>389</td> <td>0</td> <td>3</td> <td>Sadlier, Mr. Matthew</td> <td>male</td> <td>NaN</td> <td>0</td> <td>0</td> <td>367655</td> <td>7.7292</td> <td>NaN</td> <td>Q</td> </tr> <tr> <th>524</th> <td>525</td> <td>0</td> <td>3</td> <td>Kassem, Mr. Fared</td> <td>male</td> <td>NaN</td> <td>0</td> <td>0</td> <td>2700</td> <td>7.2292</td> <td>NaN</td> <td>C</td> </tr> <tr> <th>644</th> <td>645</td> <td>1</td> <td>3</td> <td>Baclini, Miss. Eugenie</td> <td>female</td> <td>0.75</td> <td>2</td> <td>1</td> <td>2666</td> <td>19.2583</td> <td>NaN</td> <td>C</td> </tr> <tr> <th>703</th> <td>704</td> <td>0</td> <td>3</td> <td>Gallagher, Mr. Martin</td> <td>male</td> <td>25.00</td> <td>0</td> <td>0</td> <td>36864</td> <td>7.7417</td> <td>NaN</td> <td>Q</td> </tr> </tbody> </table> </div> ``` # Show dataset information df.info() ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB ## 1. Remove unwanted data Not all the provided data is neccessary for the analysis, or in other words, able to answer the defined questions. It can be either that we only need certain number of columns certain number of rows. For example, sale data from 2009 to 2012 only, not the entire time. One of the early stage of data cleaning is to filter out those irrelavant data. ``` # You might want to have a copy of the original data. df2 = df.copy() ``` ``` df2.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>Name</th> <th>Sex</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>0</td> <td>3</td> <td>Braund, Mr. Owen Harris</td> <td>male</td> <td>22.0</td> <td>1</td> <td>0</td> <td>A/5 21171</td> <td>7.2500</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>1</th> <td>2</td> <td>1</td> <td>1</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>PC 17599</td> <td>71.2833</td> <td>C85</td> <td>C</td> </tr> <tr> <th>2</th> <td>3</td> <td>1</td> <td>3</td> <td>Heikkinen, Miss. Laina</td> <td>female</td> <td>26.0</td> <td>0</td> <td>0</td> <td>STON/O2. 3101282</td> <td>7.9250</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> <td>1</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>113803</td> <td>53.1000</td> <td>C123</td> <td>S</td> </tr> <tr> <th>4</th> <td>5</td> <td>0</td> <td>3</td> <td>Allen, Mr. William Henry</td> <td>male</td> <td>35.0</td> <td>0</td> <td>0</td> <td>373450</td> <td>8.0500</td> <td>NaN</td> <td>S</td> </tr> </tbody> </table> </div> The implementation on code is easy with `drop`. ``` df2.drop(columns=['Pclass']) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Name</th> <th>Sex</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>0</td> <td>Braund, Mr. Owen Harris</td> <td>male</td> <td>22.0</td> <td>1</td> <td>0</td> <td>A/5 21171</td> <td>7.2500</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>1</th> <td>2</td> <td>1</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>PC 17599</td> <td>71.2833</td> <td>C85</td> <td>C</td> </tr> <tr> <th>2</th> <td>3</td> <td>1</td> <td>Heikkinen, Miss. Laina</td> <td>female</td> <td>26.0</td> <td>0</td> <td>0</td> <td>STON/O2. 3101282</td> <td>7.9250</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>113803</td> <td>53.1000</td> <td>C123</td> <td>S</td> </tr> <tr> <th>4</th> <td>5</td> <td>0</td> <td>Allen, Mr. William Henry</td> <td>male</td> <td>35.0</td> <td>0</td> <td>0</td> <td>373450</td> <td>8.0500</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>886</th> <td>887</td> <td>0</td> <td>Montvila, Rev. Juozas</td> <td>male</td> <td>27.0</td> <td>0</td> <td>0</td> <td>211536</td> <td>13.0000</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>887</th> <td>888</td> <td>1</td> <td>Graham, Miss. Margaret Edith</td> <td>female</td> <td>19.0</td> <td>0</td> <td>0</td> <td>112053</td> <td>30.0000</td> <td>B42</td> <td>S</td> </tr> <tr> <th>888</th> <td>889</td> <td>0</td> <td>Johnston, Miss. Catherine Helen "Carrie"</td> <td>female</td> <td>NaN</td> <td>1</td> <td>2</td> <td>W./C. 6607</td> <td>23.4500</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>889</th> <td>890</td> <td>1</td> <td>Behr, Mr. Karl Howell</td> <td>male</td> <td>26.0</td> <td>0</td> <td>0</td> <td>111369</td> <td>30.0000</td> <td>C148</td> <td>C</td> </tr> <tr> <th>890</th> <td>891</td> <td>0</td> <td>Dooley, Mr. Patrick</td> <td>male</td> <td>32.0</td> <td>0</td> <td>0</td> <td>370376</td> <td>7.7500</td> <td>NaN</td> <td>Q</td> </tr> </tbody> </table> <p>891 rows × 11 columns</p> </div> ``` df2.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>Name</th> <th>Sex</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>0</td> <td>3</td> <td>Braund, Mr. Owen Harris</td> <td>male</td> <td>22.0</td> <td>1</td> <td>0</td> <td>A/5 21171</td> <td>7.2500</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>1</th> <td>2</td> <td>1</td> <td>1</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>PC 17599</td> <td>71.2833</td> <td>C85</td> <td>C</td> </tr> <tr> <th>2</th> <td>3</td> <td>1</td> <td>3</td> <td>Heikkinen, Miss. Laina</td> <td>female</td> <td>26.0</td> <td>0</td> <td>0</td> <td>STON/O2. 3101282</td> <td>7.9250</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> <td>1</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>113803</td> <td>53.1000</td> <td>C123</td> <td>S</td> </tr> <tr> <th>4</th> <td>5</td> <td>0</td> <td>3</td> <td>Allen, Mr. William Henry</td> <td>male</td> <td>35.0</td> <td>0</td> <td>0</td> <td>373450</td> <td>8.0500</td> <td>NaN</td> <td>S</td> </tr> </tbody> </table> </div> ``` # Drop columns df2.drop(columns=['Cabin', 'Embarked'], inplace=True) # Cach 1 df2 = df2.drop(columns=['Cabin', 'Embarked']) #Cach 2 # Drop rows df2.drop(index=[0, 2, 3], inplace=True) ``` ``` df2 ``` Change the name of specific column ▸ {old name: new name} ``` df2 = df2.rename(columns={'Name':'PassengerName', 'SibSp':'NumberSibling'}) df2 ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>PassengerName</th> <th>Sex</th> <th>Age</th> <th>NumberSibling</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>0</td> <td>3</td> <td>Braund, Mr. Owen Harris</td> <td>male</td> <td>22.0</td> <td>1</td> <td>0</td> <td>A/5 21171</td> <td>7.2500</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>1</th> <td>2</td> <td>1</td> <td>1</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>PC 17599</td> <td>71.2833</td> <td>C85</td> <td>C</td> </tr> <tr> <th>2</th> <td>3</td> <td>1</td> <td>3</td> <td>Heikkinen, Miss. Laina</td> <td>female</td> <td>26.0</td> <td>0</td> <td>0</td> <td>STON/O2. 3101282</td> <td>7.9250</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> <td>1</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>113803</td> <td>53.1000</td> <td>C123</td> <td>S</td> </tr> <tr> <th>4</th> <td>5</td> <td>0</td> <td>3</td> <td>Allen, Mr. William Henry</td> <td>male</td> <td>35.0</td> <td>0</td> <td>0</td> <td>373450</td> <td>8.0500</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>886</th> <td>887</td> <td>0</td> <td>2</td> <td>Montvila, Rev. Juozas</td> <td>male</td> <td>27.0</td> <td>0</td> <td>0</td> <td>211536</td> <td>13.0000</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>887</th> <td>888</td> <td>1</td> <td>1</td> <td>Graham, Miss. Margaret Edith</td> <td>female</td> <td>19.0</td> <td>0</td> <td>0</td> <td>112053</td> <td>30.0000</td> <td>B42</td> <td>S</td> </tr> <tr> <th>888</th> <td>889</td> <td>0</td> <td>3</td> <td>Johnston, Miss. Catherine Helen "Carrie"</td> <td>female</td> <td>NaN</td> <td>1</td> <td>2</td> <td>W./C. 6607</td> <td>23.4500</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>889</th> <td>890</td> <td>1</td> <td>1</td> <td>Behr, Mr. Karl Howell</td> <td>male</td> <td>26.0</td> <td>0</td> <td>0</td> <td>111369</td> <td>30.0000</td> <td>C148</td> <td>C</td> </tr> <tr> <th>890</th> <td>891</td> <td>0</td> <td>3</td> <td>Dooley, Mr. Patrick</td> <td>male</td> <td>32.0</td> <td>0</td> <td>0</td> <td>370376</td> <td>7.7500</td> <td>NaN</td> <td>Q</td> </tr> </tbody> </table> <p>891 rows × 12 columns</p> </div> Change all the columns' names. Names must be in the same order with the columns. ``` df2.columns = ['A', 'B', 'C', 'D', 'E', 'F', '7', '8', '9', '10', '11', '12'] ``` ``` df2.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>A</th> <th>B</th> <th>C</th> <th>D</th> <th>E</th> <th>F</th> <th>7</th> <th>8</th> <th>9</th> <th>10</th> <th>11</th> <th>12</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>0</td> <td>3</td> <td>Braund, Mr. Owen Harris</td> <td>male</td> <td>22.0</td> <td>1</td> <td>0</td> <td>A/5 21171</td> <td>7.2500</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>1</th> <td>2</td> <td>1</td> <td>1</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>PC 17599</td> <td>71.2833</td> <td>C85</td> <td>C</td> </tr> <tr> <th>2</th> <td>3</td> <td>1</td> <td>3</td> <td>Heikkinen, Miss. Laina</td> <td>female</td> <td>26.0</td> <td>0</td> <td>0</td> <td>STON/O2. 3101282</td> <td>7.9250</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> <td>1</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>113803</td> <td>53.1000</td> <td>C123</td> <td>S</td> </tr> <tr> <th>4</th> <td>5</td> <td>0</td> <td>3</td> <td>Allen, Mr. William Henry</td> <td>male</td> <td>35.0</td> <td>0</td> <td>0</td> <td>373450</td> <td>8.0500</td> <td>NaN</td> <td>S</td> </tr> </tbody> </table> </div> ## 2. Check and change datatype ``` # Check datatype df.info() ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB ``` # Using astype to change the datatype df['Age'] = df['Age'].astype('int') ``` ``` df.info() ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB