# Data Wrangling Techniques
1. Dealing Missing Values
2. Dealing Data Formatting
3. Data Normalization
4. Binning
5. Turning Categorical Data to Quantitative Data
## 1. Dealing Missing Values
Steps:
1. Identify missing data
2. Deal with missing data
3. Correct data format
#### Converting missing values to NaN(not a number)
- df.replace()
```python
df.replace("?", np.nan, inplace=True)
```
#### Evaluating for missing values
- df.isna() or df.isnull()
- df.notna() or df.notnull()
```python
missing_data = df.isna()
```
- **Quicker method:**
```python
df.isna().sum() # return int
```
or
```python
df.isna().any() # return bool
```
#### Counting missing values
```python
missing_data = df.isna()
for column in missing_data.columns.values.tolist():
print(column)
print(missing_data[column].value_counts())
print("")
```
#### Deal with missing data
1. Drop data
- Drop the whole row
- Drop the whole column
```python
df.dropna(subset=[column], axis=0, inplace=True)
```
2. Replace data
- Replace by mean
```python
average = df[column].astype(float).mean()
df[column].replace(np.nan, average, inplace=True)
```
- Replace by frequency
```python
df[column].value_counts().idxmax()
```
## 2. Correct data format
- .dtypes to check data types for all columns
- .astype() to convert data type
## 3. Data Normalization
- Simple method
```python
df[column] = df[column] / df[column].max()
```
- Using Range
```python
df[column] = (df[column] - df[column].min()) / (df[column].max() - df[column].min())
```
- Z-score
```python
df[column] = (df[column] - df[column].mean()) / df[column].std()
```
## 4. Binning
Steps:
1. Divide into equal bins using np.linspace()
2. Set group names
3. Use pd.cut() function
```python
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
group_names = ['Low', 'Medium', 'High']
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
df[['horsepower','horsepower-binned']].head(20)
```
## 5. Categorical to Quantitative Data
* Using Indicator variable or Dummy variable
* pd.get_dummies(df[column])
```python
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
# merge data frame "df" and "dummy_variable_1"
df = pd.concat([df, dummy_variable_1], axis=1)
# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)
```