# Feature Engineering Teachniques
###### tags: `Machine Learning`, `Feature Engineering`
## 1. Missing Data Imputation
- Complete Case Analysis (Remove row)
- Mean/Median
- Arbitrary value (Điền 1 số cụ thể # range của trường đó)
- Mode
- Missing Category (Điền giá trị "Missing")
- Random theo phân phối xác suất trường đó
- Missing Indicator (Tạo cột mới và đánh dấu trường miss là 1)
- KNN
- MICE
#### Code Simple method
```python=
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
features_numeric = ['BsmtUnfSF', 'LotFrontage', 'MasVnrArea', ]
features_categoric = ['BsmtQual', 'FireplaceQu', 'MSZoning',
'Street', 'Alley']
preprocessor = ColumnTransformer(transformers=[
('numeric_imputer', SimpleImputer(strategy='mean'), features_numeric),
('categoric_imputer', SimpleImputer(strategy='most_frequent'), features_categoric),
('imputer_LotFrontAge', SimpleImputer(
strategy='constant', fill_value=999), ['LotFrontage'])
])
```
#### Code KNN
```python=
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipe = Pipeline(steps=[
('imputer', KNNImputer(
n_neighbors=5,
weights='distance',
add_indicator=False)),
('scaler', StandardScaler()),
('regressor', Lasso(max_iter=2000)),
])
param_grid = {
'imputer__n_neighbos': [3,5,10],
'imputer__weights': ['uniform', 'distance'],
'imputer__add_indicator': [True, False],
'regressor__alpha': [10, 100, 200],
}
grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='r2')
grid_search.fit(X_train, y_train)
```
#### Code MICE
```python=
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(
estimator=BayesianRidge(), # the estimator to predict the NA (KNeighborsRegressor(n_neighbors=5),DecisionTreeRegressor(max_features='sqrt', random_state=0),)
initial_strategy='mean', # how will NA be imputed in step 1
max_iter=10, # number of cycles
imputation_order='ascending', # the order in which to impute the variables
n_nearest_features=None, # whether to limit the number of predictors
skip_complete=True, # whether to ignore variables without NA
random_state=0,
)
imputer.fit(X_train)
train_t = imputer.transform(X_train)
```
## 2. Category Encoding
- One hot
- Top one hot
- Integer Encoding
- Count/Frequence Encoding
- Mean Encoding
- Ratio Encoding
- Weigh of Evidence
- Rare Encoding
## 3. Variable Transformation
-> Transform data to a normal distribution
- Logarit: np.log(X)
- Reciprocal: 1/X
- Square root: X**(1/2)
- Exp: X**k
- Box-Cox
- Yeo-Johnson
## 4. Descretisation
-> Process of transforming continuous variables into discrete variables.
- Equal-width
- Equal-frequency
- K-means
- Decision Trees
## 5. Outlier Handling
- **Trimming**: remove outliers from data
- **Treat outliers as missing data**
- **Discrestisation**: place outliers in border bins
- **Censoring**: Capping
## 6. Feature Scaling
- Standardisation [0,1]: $x_i = \frac{x_i - \mu}{\sigma}$
- Mean Normalisation [-1, 1]: $x_i = \frac{x_i - \mu}{x_{max} - x_{min}}$
- MinMaxScaling [0,1]: $x_i = \frac{x_i - x_{min}}{x_{max} - x_{min}}$
- MaxAbsScaling [-1, 1]: $x_i = \frac{x_i}{x_{|max|}}$
- RobustScaling: $x_i = \frac{x_i - \mu}{IQR}$ (Handle outlier)
- Scaling to Unit norm: