Isolation Forest

# Isolation Forest * __Links Ref__ * [paper](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf) * [visualization](https://towardsdatascience.com/anomaly-detection-with-isolation-forest-visualization-23cd75c281e2) * [function trial on colab](https://colab.research.google.com/drive/1ogOu5CI-VbKy17OHypq2Jx_rWrhvu7mm?usp=sharing) * __Implementation Documentation__: * [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html) : Enough for most data analysis related tasks ```python model = IsolationForest( n_estimators=100, max_samples="auto", contamination="auto" ) model.fit(data) data['scores'] = model.decision_function(data) #float between -0.5 and 0.5 with -0.5 being anomaly data['anomaly_score'] = model.predict(data) #either 1 or -1 ``` * [spark](https://github.com/titicaca/spark-iforest) : Has support for Hadoop, good for big data * __What is Isolation Forest (iForest)?__ Isolation Forest is an unsupervised machine learning algorithm that is based on the Decision Tree algorithm. It isolates the outliers by randomly selecting a feature from the given set of features and then randomly selecting a split value between the max and min values of that feature. This random partitioning of features will produce shorter paths in trees for the anomalous data points, thus distinguishing them from the rest of the data. * __How to Detect Anomalies?__ * Anomalies would be the nodes which are less connected to the others thus being isolated faster. * The algorithm would take the all of the isolated nodes until the specified contamination percentage. * __Notice__: All of the nodes will eventually be isolated. * __Contamination:__ What percentage of the data is normaly outliers. * __I/O Details__ * Label Encoder: * iForest can handle any dimension of data. * Data must be a **numeric** value. All strings must be encoded into a number. It is good practice to have a dictionary on hand. * The data encoding does not have to be ordinal. **ONE-HOT ENCODING IS NOT REQUIRED**. * Handling Time Stamps * iForest does **not understand time seasonality** * We can divide the time stamps into one column for hour, days, dates, months, etc * Get the mean of each value to let the algorithm know the context of the data. * Output * iForest (in sklearn) will output `1` for normal values and `-1` anomaly values. * Usually people put `1` as anomaly value and `0` as normal value. Mapping could be done. ```python df['anomaly_score'] = df['anomaly_score'].map({1: 0, -1: 1}) ``` * Trouble Shooting * Set the correct percentage of contamination * Plot the data and select a threshold for the data to be considered an anomaly through `df['score']` * __My Conclusion:__ iForest is a very efficient and effective algorithm when there is enough domain knowledge to determine the hyperparameters. It is easy to use if you have predominantly numeric data. ###### tags: `Notes`