Virgil - Data Cleaning - S21 Missing Values Duplication Mislabel

--- title: Virgil - Data Cleaning - S21 Missing Values Duplication Mislabel tags: Virgil, LearnWorld, DataCleaning --- <a target="_blank" href="https://colab.research.google.com/drive/1iXzKsWGmWvXSiwrG6SODFqVWJq5i1rO4"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a> ``` import pandas as pd ``` ## 3. Working with Missing Values **▸ Sources of Missing Values** Before we dive into code, it’s important to understand the sources of missing data. Here’s some typical reasons why data is missing: - User forgot to fill in a field. - Data was lost while transferring manually from a legacy database. - There was a programming error. - Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted. As you can see, some of these sources are just simple random mistakes. Other times, there can be a deeper reason why data is missing. **▸ Methods to work with Missing Values** * Delete the case with missing observations. This is OK if this only causes the loss of a relatively small number of cases. This is the simplest solution. * Fill-in the missing value with mean, mode, median or other constant value. * Use the rest of the data to predict the missing values by regression, KNN, ... * Missing observation correlation. Consider just (xi, yi) pairs with some observations missing. The means and SDs of x and y can be used in the estimate even when a member of a pair is missing. In Pandas, there are several useful methods for detecting, removing, and replacing null values: * `isna()` or `isnull()`: Generate a boolean mask indicating missing values * `notna()` or `notnull()`: Opposite of isnull() * `dropna()`: Return a filtered version of the data * `fillna()`: Return a copy of the data with missing values filled or imputed **▸ Detecting null value**: Pandas data structures have two useful methods for detecting null data: `isnull()` and `notnull()`. Either one will return a Boolean mask over the data. For example: ``` data = pd.DataFrame({'Column_1': [1, None, 2], 'Column_2': [2, 3, 5], 'Column_3': [None, 4, 6]}) data ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Column_1</th> <th>Column_2</th> <th>Column_3</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1.0</td> <td>2</td> <td>NaN</td> </tr> <tr> <th>1</th> <td>NaN</td> <td>3</td> <td>4.0</td> </tr> <tr> <th>2</th> <td>2.0</td> <td>5</td> <td>6.0</td> </tr> </tbody> </table> </div> ``` data.isna() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Column_1</th> <th>Column_2</th> <th>Column_3</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>False</td> <td>False</td> <td>True</td> </tr> <tr> <th>1</th> <td>True</td> <td>False</td> <td>False</td> </tr> <tr> <th>2</th> <td>False</td> <td>False</td> <td>False</td> </tr> </tbody> </table> </div> ``` # True = 1, False = 0 True + False + True ``` 2 ``` # Python: True = 1, False = 0 data.isna().sum() ``` Column_1 1 Column_2 0 Column_3 1 dtype: int64 ``` # Show các dòng có nan value? data[data['Column_1'].isna()] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Column_1</th> <th>Column_2</th> <th>Column_3</th> </tr> </thead> <tbody> <tr> <th>1</th> <td>NaN</td> <td>3</td> <td>4.0</td> </tr> </tbody> </table> </div> **▸ Dropping null value**: In addition to the masking used before, there are convenient methods, `dropna()` (which removes NA values) and `fillna()` (which fills in NA values). The result is straightforward but be careful that you have to call `inplace=True` if you want to commit your change to the dataframe. ``` data ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Column_1</th> <th>Column_2</th> <th>Column_3</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1.0</td> <td>2</td> <td>NaN</td> </tr> <tr> <th>1</th> <td>NaN</td> <td>3</td> <td>4.0</td> </tr> <tr> <th>2</th> <td>2.0</td> <td>5</td> <td>6.0</td> </tr> </tbody> </table> </div> ``` # Drop the rows data.dropna(axis=0) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Column_1</th> <th>Column_2</th> <th>Column_3</th> </tr> </thead> <tbody> <tr> <th>2</th> <td>2.0</td> <td>5</td> <td>6.0</td> </tr> </tbody> </table> </div> ``` data ``` We cannot drop single values from a DataFrame; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so `dropna()` gives a number of options for a DataFrame. By default, `dropna()` will drop all rows in which any null value is present. ``` # Drop the columns data.dropna(axis=1) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Column_2</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>2</td> </tr> <tr> <th>1</th> <td>3</td> </tr> <tr> <th>2</th> <td>5</td> </tr> </tbody> </table> </div> ``` data ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Column_1</th> <th>Column_2</th> <th>Column_3</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1.0</td> <td>2</td> <td>NaN</td> </tr> <tr> <th>1</th> <td>NaN</td> <td>3</td> <td>4.0</td> </tr> <tr> <th>2</th> <td>2.0</td> <td>5</td> <td>6.0</td> </tr> </tbody> </table> </div> **▸ Filling null values**: Sometimes rather than dropping NA values, you'd rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this in-place using the `isnull()` method as a mask, but because it is such a common operation Pandas provides the `fillna()` method, which returns a copy of the array with the null values replaced. ``` data ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Column_1</th> <th>Column_2</th> <th>Column_3</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1.0</td> <td>2</td> <td>NaN</td> </tr> <tr> <th>1</th> <td>NaN</td> <td>3</td> <td>4.0</td> </tr> <tr> <th>2</th> <td>2.0</td> <td>5</td> <td>6.0</td> </tr> </tbody> </table> </div> ``` # Fill all missing values with the constant 0 data['Column_1'].fillna(0) ``` 0 1.0 1 0.0 2 2.0 Name: Column_1, dtype: float64 ``` data ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Column_1</th> <th>Column_2</th> <th>Column_3</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>2</td> <td>Unknown</td> </tr> <tr> <th>1</th> <td>Unknown</td> <td>3</td> <td>4</td> </tr> <tr> <th>2</th> <td>2</td> <td>5</td> <td>6</td> </tr> </tbody> </table> </div> ``` data.info() ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Column_1 3 non-null object 1 Column_2 3 non-null int64 2 Column_3 3 non-null object dtypes: int64(1), object(2) memory usage: 200.0+ bytes **Example on the dataset** ``` df = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/titanic.csv') ``` ``` df.sample(5) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>Name</th> <th>Sex</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>301</th> <td>302</td> <td>1</td> <td>3</td> <td>McCoy, Mr. Bernard</td> <td>male</td> <td>NaN</td> <td>2</td> <td>0</td> <td>367226</td> <td>23.2500</td> <td>NaN</td> <td>Q</td> </tr> <tr> <th>554</th> <td>555</td> <td>1</td> <td>3</td> <td>Ohman, Miss. Velin</td> <td>female</td> <td>22.0</td> <td>0</td> <td>0</td> <td>347085</td> <td>7.7750</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>330</th> <td>331</td> <td>1</td> <td>3</td> <td>McCoy, Miss. Agnes</td> <td>female</td> <td>NaN</td> <td>2</td> <td>0</td> <td>367226</td> <td>23.2500</td> <td>NaN</td> <td>Q</td> </tr> <tr> <th>869</th> <td>870</td> <td>1</td> <td>3</td> <td>Johnson, Master. Harold Theodor</td> <td>male</td> <td>4.0</td> <td>1</td> <td>1</td> <td>347742</td> <td>11.1333</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>227</th> <td>228</td> <td>0</td> <td>3</td> <td>Lovell, Mr. John Hall ("Henry")</td> <td>male</td> <td>20.5</td> <td>0</td> <td>0</td> <td>A/5 21173</td> <td>7.2500</td> <td>NaN</td> <td>S</td> </tr> </tbody> </table> </div> ``` # Check for null value df.isna().sum() ``` PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64 ``` df['Age'].isna() ``` ``` # Display the rows where Age is null df[df['Age'].isna()] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>Name</th> <th>Sex</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>5</th> <td>6</td> <td>0</td> <td>3</td> <td>Moran, Mr. James</td> <td>male</td> <td>NaN</td> <td>0</td> <td>0</td> <td>330877</td> <td>8.4583</td> <td>NaN</td> <td>Q</td> </tr> <tr> <th>17</th> <td>18</td> <td>1</td> <td>2</td> <td>Williams, Mr. Charles Eugene</td> <td>male</td> <td>NaN</td> <td>0</td> <td>0</td> <td>244373</td> <td>13.0000</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>19</th> <td>20</td> <td>1</td> <td>3</td> <td>Masselmani, Mrs. Fatima</td> <td>female</td> <td>NaN</td> <td>0</td> <td>0</td> <td>2649</td> <td>7.2250</td> <td>NaN</td> <td>C</td> </tr> <tr> <th>26</th> <td>27</td> <td>0</td> <td>3</td> <td>Emir, Mr. Farred Chehab</td> <td>male</td> <td>NaN</td> <td>0</td> <td>0</td> <td>2631</td> <td>7.2250</td> <td>NaN</td> <td>C</td> </tr> <tr> <th>28</th> <td>29</td> <td>1</td> <td>3</td> <td>O'Dwyer, Miss. Ellen "Nellie"</td> <td>female</td> <td>NaN</td> <td>0</td> <td>0</td> <td>330959</td> <td>7.8792</td> <td>NaN</td> <td>Q</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>859</th> <td>860</td> <td>0</td> <td>3</td> <td>Razi, Mr. Raihed</td> <td>male</td> <td>NaN</td> <td>0</td> <td>0</td> <td>2629</td> <td>7.2292</td> <td>NaN</td> <td>C</td> </tr> <tr> <th>863</th> <td>864</td> <td>0</td> <td>3</td> <td>Sage, Miss. Dorothy Edith "Dolly"</td> <td>female</td> <td>NaN</td> <td>8</td> <td>2</td> <td>CA. 2343</td> <td>69.5500</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>868</th> <td>869</td> <td>0</td> <td>3</td> <td>van Melkebeke, Mr. Philemon</td> <td>male</td> <td>NaN</td> <td>0</td> <td>0</td> <td>345777</td> <td>9.5000</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>878</th> <td>879</td> <td>0</td> <td>3</td> <td>Laleff, Mr. Kristo</td> <td>male</td> <td>NaN</td> <td>0</td> <td>0</td> <td>349217</td> <td>7.8958</td> <td>NaN</td> <td>S</td> </tr> <tr> <th>888</th> <td>889</td> <td>0</td> <td>3</td> <td>Johnston, Miss. Catherine Helen "Carrie"</td> <td>female</td> <td>NaN</td> <td>1</td> <td>2</td> <td>W./C. 6607</td> <td>23.4500</td> <td>NaN</td> <td>S</td> </tr> </tbody> </table> <p>177 rows × 12 columns</p> </div> ``` # Ratio of the number of NaN values compared to the data df[df['Age'].isna()].shape[0] / df.shape[0] ``` 0.19865319865319866 ``` # Drop the data? df['Age'].dropna() ``` ``` df['Age'].mean() ``` 29.69911764705882 ``` # Fill with the average Age df['Age'] = df['Age'].fillna(df['Age'].mean()) ``` ``` df.isna().sum() ``` PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64 ``` # Fill null values with "Unknown" df['Cabin'] = df['Cabin'].fillna('Unknown') ``` ``` # Check for null value again df.isna().sum() ``` ## 4. Handling duplicated data ``` df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>Name</th> <th>Sex</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>0</td> <td>3</td> <td>Braund, Mr. Owen Harris</td> <td>male</td> <td>22.0</td> <td>1</td> <td>0</td> <td>A/5 21171</td> <td>7.2500</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>1</th> <td>2</td> <td>1</td> <td>1</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>PC 17599</td> <td>71.2833</td> <td>C85</td> <td>C</td> </tr> <tr> <th>2</th> <td>3</td> <td>1</td> <td>3</td> <td>Heikkinen, Miss. Laina</td> <td>female</td> <td>26.0</td> <td>0</td> <td>0</td> <td>STON/O2. 3101282</td> <td>7.9250</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> <td>1</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>113803</td> <td>53.1000</td> <td>C123</td> <td>S</td> </tr> <tr> <th>4</th> <td>5</td> <td>0</td> <td>3</td> <td>Allen, Mr. William Henry</td> <td>male</td> <td>35.0</td> <td>0</td> <td>0</td> <td>373450</td> <td>8.0500</td> <td>Unknown</td> <td>S</td> </tr> </tbody> </table> </div> ``` # Check for the whole duplicated rows df.duplicated().sum() ``` 0 ``` # Check for duplication in "Name" only df['Name'].duplicated().sum() ``` 0 ``` # Create a temporary table for study temp = pd.concat([df, df.head()], axis=0) temp ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>Name</th> <th>Sex</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>0</td> <td>3</td> <td>Braund, Mr. Owen Harris</td> <td>male</td> <td>22.0</td> <td>1</td> <td>0</td> <td>A/5 21171</td> <td>7.2500</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>1</th> <td>2</td> <td>1</td> <td>1</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>PC 17599</td> <td>71.2833</td> <td>C85</td> <td>C</td> </tr> <tr> <th>2</th> <td>3</td> <td>1</td> <td>3</td> <td>Heikkinen, Miss. Laina</td> <td>female</td> <td>26.0</td> <td>0</td> <td>0</td> <td>STON/O2. 3101282</td> <td>7.9250</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> <td>1</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>113803</td> <td>53.1000</td> <td>C123</td> <td>S</td> </tr> <tr> <th>4</th> <td>5</td> <td>0</td> <td>3</td> <td>Allen, Mr. William Henry</td> <td>male</td> <td>35.0</td> <td>0</td> <td>0</td> <td>373450</td> <td>8.0500</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>0</th> <td>1</td> <td>0</td> <td>3</td> <td>Braund, Mr. Owen Harris</td> <td>male</td> <td>22.0</td> <td>1</td> <td>0</td> <td>A/5 21171</td> <td>7.2500</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>1</th> <td>2</td> <td>1</td> <td>1</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>PC 17599</td> <td>71.2833</td> <td>C85</td> <td>C</td> </tr> <tr> <th>2</th> <td>3</td> <td>1</td> <td>3</td> <td>Heikkinen, Miss. Laina</td> <td>female</td> <td>26.0</td> <td>0</td> <td>0</td> <td>STON/O2. 3101282</td> <td>7.9250</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> <td>1</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>113803</td> <td>53.1000</td> <td>C123</td> <td>S</td> </tr> <tr> <th>4</th> <td>5</td> <td>0</td> <td>3</td> <td>Allen, Mr. William Henry</td> <td>male</td> <td>35.0</td> <td>0</td> <td>0</td> <td>373450</td> <td>8.0500</td> <td>Unknown</td> <td>S</td> </tr> </tbody> </table> <p>896 rows × 12 columns</p> </div> ``` temp.duplicated().sum() ``` 5 ``` # Display the duplication temp[temp.duplicated()] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>Name</th> <th>Sex</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>0</td> <td>3</td> <td>Braund, Mr. Owen Harris</td> <td>male</td> <td>22.0</td> <td>1</td> <td>0</td> <td>A/5 21171</td> <td>7.2500</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>1</th> <td>2</td> <td>1</td> <td>1</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>female</td> <td>38.0</td> <td>1</td> <td>0</td> <td>PC 17599</td> <td>71.2833</td> <td>C85</td> <td>C</td> </tr> <tr> <th>2</th> <td>3</td> <td>1</td> <td>3</td> <td>Heikkinen, Miss. Laina</td> <td>female</td> <td>26.0</td> <td>0</td> <td>0</td> <td>STON/O2. 3101282</td> <td>7.9250</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> <td>1</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>female</td> <td>35.0</td> <td>1</td> <td>0</td> <td>113803</td> <td>53.1000</td> <td>C123</td> <td>S</td> </tr> <tr> <th>4</th> <td>5</td> <td>0</td> <td>3</td> <td>Allen, Mr. William Henry</td> <td>male</td> <td>35.0</td> <td>0</td> <td>0</td> <td>373450</td> <td>8.0500</td> <td>Unknown</td> <td>S</td> </tr> </tbody> </table> </div> ``` # Chọn data không duplicated # Display the non-duplicated rows: ~ means not temp[~temp.duplicated()] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>Name</th> <th>Sex</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Ticket</th> <th>Fare</th> <th>Cabin</th> <th>Embarked</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>0</td> <td>3</td> <td>Braund, Mr. Owen Harris</td> <td>male</td> <td>22.000000</td> <td>1</td> <td>0</td> <td>A/5 21171</td> <td>7.2500</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>1</th> <td>2</td> <td>1</td> <td>1</td> <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td> <td>female</td> <td>38.000000</td> <td>1</td> <td>0</td> <td>PC 17599</td> <td>71.2833</td> <td>C85</td> <td>C</td> </tr> <tr> <th>2</th> <td>3</td> <td>1</td> <td>3</td> <td>Heikkinen, Miss. Laina</td> <td>female</td> <td>26.000000</td> <td>0</td> <td>0</td> <td>STON/O2. 3101282</td> <td>7.9250</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> <td>1</td> <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td> <td>female</td> <td>35.000000</td> <td>1</td> <td>0</td> <td>113803</td> <td>53.1000</td> <td>C123</td> <td>S</td> </tr> <tr> <th>4</th> <td>5</td> <td>0</td> <td>3</td> <td>Allen, Mr. William Henry</td> <td>male</td> <td>35.000000</td> <td>0</td> <td>0</td> <td>373450</td> <td>8.0500</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>886</th> <td>887</td> <td>0</td> <td>2</td> <td>Montvila, Rev. Juozas</td> <td>male</td> <td>27.000000</td> <td>0</td> <td>0</td> <td>211536</td> <td>13.0000</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>887</th> <td>888</td> <td>1</td> <td>1</td> <td>Graham, Miss. Margaret Edith</td> <td>female</td> <td>19.000000</td> <td>0</td> <td>0</td> <td>112053</td> <td>30.0000</td> <td>B42</td> <td>S</td> </tr> <tr> <th>888</th> <td>889</td> <td>0</td> <td>3</td> <td>Johnston, Miss. Catherine Helen "Carrie"</td> <td>female</td> <td>29.699118</td> <td>1</td> <td>2</td> <td>W./C. 6607</td> <td>23.4500</td> <td>Unknown</td> <td>S</td> </tr> <tr> <th>889</th> <td>890</td> <td>1</td> <td>1</td> <td>Behr, Mr. Karl Howell</td> <td>male</td> <td>26.000000</td> <td>0</td> <td>0</td> <td>111369</td> <td>30.0000</td> <td>C148</td> <td>C</td> </tr> <tr> <th>890</th> <td>891</td> <td>0</td> <td>3</td> <td>Dooley, Mr. Patrick</td> <td>male</td> <td>32.000000</td> <td>0</td> <td>0</td> <td>370376</td> <td>7.7500</td> <td>Unknown</td> <td>Q</td> </tr> </tbody> </table> <p>891 rows × 12 columns</p> </div> ## 5. Handle mislabeled and corrupted data To do this task, we need to go to each column and see what are really there. Some useful tool can be used at this scanning process are: `.describe()`: to return a statistical overview. `.unique()`, `.nunique()`, `.value_counts()`: review on categorical columns `.apply()`: To scan through the rows and fix them with predefined function. `.str.replace()`: string method to replace unwanted value in string columns. ``` # Continuous df.describe() ``` ``` df.describe() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>PassengerId</th> <th>Survived</th> <th>Pclass</th> <th>Age</th> <th>SibSp</th> <th>Parch</th> <th>Fare</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>891.000000</td> <td>891.000000</td> <td>891.000000</td> <td>891.000000</td> <td>891.000000</td> <td>891.000000</td> <td>891.000000</td> </tr> <tr> <th>mean</th> <td>446.000000</td> <td>0.383838</td> <td>2.308642</td> <td>29.699118</td> <td>0.523008</td> <td>0.381594</td> <td>32.204208</td> </tr> <tr> <th>std</th> <td>257.353842</td> <td>0.486592</td> <td>0.836071</td> <td>13.002015</td> <td>1.102743</td> <td>0.806057</td> <td>49.693429</td> </tr> <tr> <th>min</th> <td>1.000000</td> <td>0.000000</td> <td>1.000000</td> <td>0.420000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> </tr> <tr> <th>25%</th> <td>223.500000</td> <td>0.000000</td> <td>2.000000</td> <td>22.000000</td> <td>0.000000</td> <td>0.000000</td> <td>7.910400</td> </tr> <tr> <th>50%</th> <td>446.000000</td> <td>0.000000</td> <td>3.000000</td> <td>29.699118</td> <td>0.000000</td> <td>0.000000</td> <td>14.454200</td> </tr> <tr> <th>75%</th> <td>668.500000</td> <td>1.000000</td> <td>3.000000</td> <td>35.000000</td> <td>1.000000</td> <td>0.000000</td> <td>31.000000</td> </tr> <tr> <th>max</th> <td>891.000000</td> <td>1.000000</td> <td>3.000000</td> <td>80.000000</td> <td>8.000000</td> <td>6.000000</td> <td>512.329200</td> </tr> </tbody> </table> </div> ``` # Categorical df['Pclass'].unique() ``` array([3, 1, 2])