Feature Engineering Teachniques

# Feature Engineering Teachniques ###### tags: `Machine Learning`, `Feature Engineering` ## 1. Missing Data Imputation - Complete Case Analysis (Remove row) - Mean/Median - Arbitrary value (Điền 1 số cụ thể # range của trường đó) - Mode - Missing Category (Điền giá trị "Missing") - Random theo phân phối xác suất trường đó - Missing Indicator (Tạo cột mới và đánh dấu trường miss là 1) - KNN - MICE #### Code Simple method ```python= from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer features_numeric = ['BsmtUnfSF', 'LotFrontage', 'MasVnrArea', ] features_categoric = ['BsmtQual', 'FireplaceQu', 'MSZoning', 'Street', 'Alley'] preprocessor = ColumnTransformer(transformers=[ ('numeric_imputer', SimpleImputer(strategy='mean'), features_numeric), ('categoric_imputer', SimpleImputer(strategy='most_frequent'), features_categoric), ('imputer_LotFrontAge', SimpleImputer( strategy='constant', fill_value=999), ['LotFrontage']) ]) ``` #### Code KNN ```python= from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Lasso from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV pipe = Pipeline(steps=[ ('imputer', KNNImputer( n_neighbors=5, weights='distance', add_indicator=False)), ('scaler', StandardScaler()), ('regressor', Lasso(max_iter=2000)), ]) param_grid = { 'imputer__n_neighbos': [3,5,10], 'imputer__weights': ['uniform', 'distance'], 'imputer__add_indicator': [True, False], 'regressor__alpha': [10, 100, 200], } grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='r2') grid_search.fit(X_train, y_train) ``` #### Code MICE ```python= from sklearn.impute import IterativeImputer imputer = IterativeImputer( estimator=BayesianRidge(), # the estimator to predict the NA (KNeighborsRegressor(n_neighbors=5),DecisionTreeRegressor(max_features='sqrt', random_state=0),) initial_strategy='mean', # how will NA be imputed in step 1 max_iter=10, # number of cycles imputation_order='ascending', # the order in which to impute the variables n_nearest_features=None, # whether to limit the number of predictors skip_complete=True, # whether to ignore variables without NA random_state=0, ) imputer.fit(X_train) train_t = imputer.transform(X_train) ``` ## 2. Category Encoding - One hot - Top one hot - Integer Encoding - Count/Frequence Encoding - Mean Encoding - Ratio Encoding - Weigh of Evidence - Rare Encoding ## 3. Variable Transformation -> Transform data to a normal distribution - Logarit: np.log(X) - Reciprocal: 1/X - Square root: X**(1/2) - Exp: X**k - Box-Cox - Yeo-Johnson ## 4. Descretisation -> Process of transforming continuous variables into discrete variables. - Equal-width - Equal-frequency - K-means - Decision Trees ## 5. Outlier Handling - **Trimming**: remove outliers from data - **Treat outliers as missing data** - **Discrestisation**: place outliers in border bins - **Censoring**: Capping ## 6. Feature Scaling - Standardisation [0,1]: $x_i = \frac{x_i - \mu}{\sigma}$ - Mean Normalisation [-1, 1]: $x_i = \frac{x_i - \mu}{x_{max} - x_{min}}$ - MinMaxScaling [0,1]: $x_i = \frac{x_i - x_{min}}{x_{max} - x_{min}}$ - MaxAbsScaling [-1, 1]: $x_i = \frac{x_i}{x_{|max|}}$ - RobustScaling: $x_i = \frac{x_i - \mu}{IQR}$ (Handle outlier) - Scaling to Unit norm: