---
title: Virgil - Data Cleaning - S31 Outliers
tags: Virgil, LearnWorld, DataCleaning
---
<a target="_blank" href="https://colab.research.google.com/drive/14mOln7RdPT43xn3vVf3ZtK699TVxJjzT"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
```
import pandas as pd
```
## 6. Detect outliers
```
df = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/titanic.csv')
```
**❊ SPREAD**:
- Range, Varience, Standard Deviation
- Quantiles, Inter-quantile Range
One of the measure that describes how spread out the data is is the **Quantile**, which divides data into 4 equal quantiles.
<img src='https://cdn.scribbr.com/wp-content/uploads/2020/09/iqr_quartiles.png'>
```
df['Fare'].describe()
```
count 891.000000
mean 32.204208
std 49.693429
min 0.000000
25% 7.910400
50% 14.454200
75% 31.000000
max 512.329200
Name: Fare, dtype: float64
**Interquartiles Range**
<img src='https://miro.medium.com/max/1400/1*2c21SkzJMf3frPXPAR_gZA.png' width=700>
To identify the outliers, we follow 3 steps:
- ***Step 1***: Identify the Inter-quartiles Range
IQR = Q3 - Q1
- ***Step 2***: Calculate the lower whisker and upper whisker
upper-whisker = Q3 + 1.5*IQR
lower-whisker = Q1 - 1.5*IQR
- ***Step 3***: Identify the outliers. Data that is below the lower whisker or above the upper whisker will be considered to be outliers.
```
# Visualize with boxplot
df['Fare'].plot(kind='box', figsize=(10, 10));
```
```
# Calculate the quantiles -- Tính Q1 và Q3
q1 = df['Fare'].quantile(0.25)
q3 = df['Fare'].quantile(0.75)
# Calculate the interquantile range
iqr = q3 - q1
# Calculate the whisker
upper = q3 + 1.5*iqr
lower = q1 - 1.5*iqr
```
```
lower
```
-26.724
🥸 **DISCUSSION: Given this upper and lower values. Which one is the right syntax to filter the data that in between the range of upper and lower?**
A/ ```df[lower < df['Fare'] < upper]```
B/ ```df[lower < df['Fare'] & df['Fare'] < upper]```
C/ ```df[(lower < df['Fare']) & (df['Fare'] < upper)]```
D/ ```df[(df['Fare'] < lower) & (df['Fare'] > upper)]```
```
# Filter the data: abnormal data lies between lower and upper
abnormal = df[(lower > df['Fare']) | (df['Fare'] > upper)]
```
```
abnormal['Survived'].mean()
```
0.6810344827586207
```
abnormal.groupby('Sex')['Survived'].mean()
```
Sex
female 0.928571
male 0.304348
Name: Survived, dtype: float64
# SUMMARY:
1. Remove unwanted data:
```python
.info() #check general information
.describe() #statistically overview the column.
.drop(columns=['A', 'B', 'C'], inplace=True) #drop columns
.astype('int64') #cast the column to different datatype
.rename(columns={'Name':'PassengerName', 'SibSp':'NumberSibling'}, inplace=True) #rename columns
# Advanced
pd.to_datetime(column) #cast the column to datetime datatype
```
2. Handling missing values:
```python
.isna() # Detect nan values.
.isnull()
df['column'].isna().sum() # Count how many NaN are there in column
df[df['column'].isna()] # display the rows where there is NaN values
.fillna(0) # Fill NaN with 0
```
3. Handling duplication:
```python
.duplicated() # Detect duplication
df['column'].duplicated().sum() # Count how many duplication are there in column
df[df['column'].duplicated()] # Choose duplicated rows
df[~df['column'].duplicated()] # Choose non-duplicated rows.
```
4. Handling mislabeled and corrupted data:
- Scan through each column, use `.info()`, `.describe()`, `value_counts()`, `unique()`, `nunique()`, etc to overview the column.
- Use `.apply()`, `.str.replace()`, etc to edit the error. Or simply drop them.
5. Filter outliers using statistics.
- Step 1: Calculate the IQR: IQR = Q3 - Q1
- Step 2: Calculate whisker:
upper_whisker = Q3 + 1.5*IQR
lower_whisker = Q1 - 1.5*IQR
- Step 3: Everything below the lower whisker or above the upper whisker is considered as outliers.