Virgil - Data Cleaning - S31 Outliers

--- title: Virgil - Data Cleaning - S31 Outliers tags: Virgil, LearnWorld, DataCleaning --- <a target="_blank" href="https://colab.research.google.com/drive/14mOln7RdPT43xn3vVf3ZtK699TVxJjzT"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a> ``` import pandas as pd ``` ## 6. Detect outliers ``` df = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/titanic.csv') ``` **❊ SPREAD**: - Range, Varience, Standard Deviation - Quantiles, Inter-quantile Range One of the measure that describes how spread out the data is is the **Quantile**, which divides data into 4 equal quantiles. <img src='https://cdn.scribbr.com/wp-content/uploads/2020/09/iqr_quartiles.png'> ``` df['Fare'].describe() ``` count 891.000000 mean 32.204208 std 49.693429 min 0.000000 25% 7.910400 50% 14.454200 75% 31.000000 max 512.329200 Name: Fare, dtype: float64 **Interquartiles Range** <img src='https://miro.medium.com/max/1400/1*2c21SkzJMf3frPXPAR_gZA.png' width=700> To identify the outliers, we follow 3 steps: - ***Step 1***: Identify the Inter-quartiles Range IQR = Q3 - Q1 - ***Step 2***: Calculate the lower whisker and upper whisker upper-whisker = Q3 + 1.5*IQR lower-whisker = Q1 - 1.5*IQR - ***Step 3***: Identify the outliers. Data that is below the lower whisker or above the upper whisker will be considered to be outliers. ``` # Visualize with boxplot df['Fare'].plot(kind='box', figsize=(10, 10)); ``` ``` # Calculate the quantiles -- Tính Q1 và Q3 q1 = df['Fare'].quantile(0.25) q3 = df['Fare'].quantile(0.75) # Calculate the interquantile range iqr = q3 - q1 # Calculate the whisker upper = q3 + 1.5*iqr lower = q1 - 1.5*iqr ``` ``` lower ``` -26.724 🥸 **DISCUSSION: Given this upper and lower values. Which one is the right syntax to filter the data that in between the range of upper and lower?** A/ ```df[lower < df['Fare'] < upper]``` B/ ```df[lower < df['Fare'] & df['Fare'] < upper]``` C/ ```df[(lower < df['Fare']) & (df['Fare'] < upper)]``` D/ ```df[(df['Fare'] < lower) & (df['Fare'] > upper)]``` ``` # Filter the data: abnormal data lies between lower and upper abnormal = df[(lower > df['Fare']) | (df['Fare'] > upper)] ``` ``` abnormal['Survived'].mean() ``` 0.6810344827586207 ``` abnormal.groupby('Sex')['Survived'].mean() ``` Sex female 0.928571 male 0.304348 Name: Survived, dtype: float64 # SUMMARY: 1. Remove unwanted data: ```python .info() #check general information .describe() #statistically overview the column. .drop(columns=['A', 'B', 'C'], inplace=True) #drop columns .astype('int64') #cast the column to different datatype .rename(columns={'Name':'PassengerName', 'SibSp':'NumberSibling'}, inplace=True) #rename columns # Advanced pd.to_datetime(column) #cast the column to datetime datatype ``` 2. Handling missing values: ```python .isna() # Detect nan values. .isnull() df['column'].isna().sum() # Count how many NaN are there in column df[df['column'].isna()] # display the rows where there is NaN values .fillna(0) # Fill NaN with 0 ``` 3. Handling duplication: ```python .duplicated() # Detect duplication df['column'].duplicated().sum() # Count how many duplication are there in column df[df['column'].duplicated()] # Choose duplicated rows df[~df['column'].duplicated()] # Choose non-duplicated rows. ``` 4. Handling mislabeled and corrupted data: - Scan through each column, use `.info()`, `.describe()`, `value_counts()`, `unique()`, `nunique()`, etc to overview the column. - Use `.apply()`, `.str.replace()`, etc to edit the error. Or simply drop them. 5. Filter outliers using statistics. - Step 1: Calculate the IQR: IQR = Q3 - Q1 - Step 2: Calculate whisker: upper_whisker = Q3 + 1.5*IQR lower_whisker = Q1 - 1.5*IQR - Step 3: Everything below the lower whisker or above the upper whisker is considered as outliers.