# Data Cleaning Laundry List
Pay attention to the following. Removal of problematic data is a useful approach but should be done only if sensible. We can run side-by-side comparisons to check the effects of removal. General tools include `pd.melt` (rearrange features), `df.pivot` (undo melt), `pd.concat` (concatenate datasets).
## Sanity check
1. What are the relevant features?
1. Does my data prove or disprove a claim?
## Feature selection
- Consider to normalize the features
- Irrelevant features (subjective)
- Features with only fixed values
- 0 variance should show up during normalization
- check with `df.nunique` and remove column with `df.drop`
- Feature with few (continuous) values
- Close to 0 variance
- Can be removed with sklearn `VarianceThreshold` -- plot the number of features as threshold go up
## Encoding
- One-hot and dummy are ways to encode categorical data
- If there is order, perhaps do not encode.
- Be careful of the intercept term if encode, i.e. we may not have $k$ binary variables to encode for $k$ levels of a variable.
## Outliers
- Visualize with boxplot or scatterplot
- Check z-score
- Removing is an option if sensible
## Broken data
- Data points with missing features
- Could fill (impute) with mean (data unskewed)/median (robust to outliers/skew)
- Impute with `df.fillna`, drop with `df.drop`
- Duplicate features
- Duplicate data points
- Remove as they may appear in both train & test sets
- Check with `df.duplicated()` and remove with `df.drop_duplicates(inplace=True)`
- Inconsistencies
- NA N/A Not applicable
- String spelling errors
- String Large/small case
- String spacings/dashes
- String punctuation or symbols
- Date format
- Numerical quantity units
- Numerical format (decimals, leading zeros)
- Convert type (e.g. string/float) `df.astype`
# Sources
https://machinelearningmastery.com/basic-data-cleaning-for-machine-learning/