Data Cleaning Laundry List

# Data Cleaning Laundry List Pay attention to the following. Removal of problematic data is a useful approach but should be done only if sensible. We can run side-by-side comparisons to check the effects of removal. General tools include `pd.melt` (rearrange features), `df.pivot` (undo melt), `pd.concat` (concatenate datasets). ## Sanity check 1. What are the relevant features? 1. Does my data prove or disprove a claim? ## Feature selection - Consider to normalize the features - Irrelevant features (subjective) - Features with only fixed values - 0 variance should show up during normalization - check with `df.nunique` and remove column with `df.drop` - Feature with few (continuous) values - Close to 0 variance - Can be removed with sklearn `VarianceThreshold` -- plot the number of features as threshold go up ## Encoding - One-hot and dummy are ways to encode categorical data - If there is order, perhaps do not encode. - Be careful of the intercept term if encode, i.e. we may not have $k$ binary variables to encode for $k$ levels of a variable. ## Outliers - Visualize with boxplot or scatterplot - Check z-score - Removing is an option if sensible ## Broken data - Data points with missing features - Could fill (impute) with mean (data unskewed)/median (robust to outliers/skew) - Impute with `df.fillna`, drop with `df.drop` - Duplicate features - Duplicate data points - Remove as they may appear in both train & test sets - Check with `df.duplicated()` and remove with `df.drop_duplicates(inplace=True)` - Inconsistencies - NA N/A Not applicable - String spelling errors - String Large/small case - String spacings/dashes - String punctuation or symbols - Date format - Numerical quantity units - Numerical format (decimals, leading zeros) - Convert type (e.g. string/float) `df.astype` # Sources https://machinelearningmastery.com/basic-data-cleaning-for-machine-learning/