### Preprocessing
Performing a train-test split is a crucial step in the data analysis process, as it helps to ensure that the results of the analysis are robust and generalizable to new data. In a train-test split, the original dataset is divided into two parts: a training set and a testing set. The training set is used to train the model, while the testing set is used to evaluate its performance.
**Why should train test split?**<br />
It is important to perform a train-test split at the first step of the analysis to prevent any data leakage. Data leakage occurs when information from the testing set is used to train the model, which can lead to overfitting and artificially high performance metrics. By splitting the data before any data cleaning or preprocessing is performed, we ensure that the training and testing sets are independent of each other and that the results are representative of the true performance of the model.
As the final dataset will be reduced to 20,000 rows due to the presence of only 20,000 unique booking IDs, it makes it possible to perform cross-validation to evaluate our models later, which is a much more robust method for evaluating and determining the performance a model compared to a simple hold out set. Thus the splitting will be performed like so:
**How will we do it?**<br />
To start with, the dataset will be split into two parts: a training set and a testing set. The split will be an 80-20 split, with 20% of the data being set aside for testing purposes. This initial split serves as a rough evaluation of the model's performance on new data.
Later in the analysis, the training set will be further split into two parts: a training set and a validation set. The validation set will consist of 20% of the dataset (or 25% of the training data), mimicking the same size as the testing data. The validation split will be used to determine the best parameters and models, such that we can perform a final evaluation on our best model.
There are 20,000 unique booking IDs, but many rows of data for each booking ID. When splitting the dataset into a training set and a testing set, it is important to ensure that the split is done in a way that ensures that each booking ID exists either in the training set or the testing set, but not both.
This is important because if a booking ID exists in both the training set and the testing set, it can lead to data leakage. Data leakage occurs when information from the testing set is used to train the model, which can lead to overfitting and artificially high performance metrics. By ensuring that each booking ID exists either in the training set or the testing set, but not both, we can prevent data leakage and ensure that the results of the analysis are robust and generalizable to new data.
### Preprocessing: Manual Univariate Outlier Removal
Manual univariate outlier removal is performed to remove data that does not make sense. This includes data points with -1 Speed and rides that last longer than a few hours.
The existence of such data points can affect the results of the analysis, especially in the case of the -1 Speed data. This is because the -1 Speed is very close to 0, and there are many valid data points in the dataset with a speed of 0. Outlier detection methods may not detect the -1 Speed as an outlier because it is so close to the valid data points.
By manually filtering these data points, we can ensure that the results of the analysis are robust and representative of the true performance of the model. It is important to perform manual univariate outlier removal when the dataset contains data points that do not make sense and could potentially affect the results of the analysis.
#### Ensuring that the booking IDs are split properly
A set intersection is performed between the booking IDs in the training data and the booking IDs in the testing data to determine if there are any booking IDs that exist in both.
This is a mathematical operation that returns the common elements between two sets. In this case, it is used to determine if any of the booking IDs in the training data are also present in the testing data. This is important to prevent data leakage, as explained in a previous answer.
By performing a set intersection between the booking IDs in the training data and the testing data, we can identify any booking IDs that exist in both sets. If such booking IDs are found, it would indicate that the split between the training data and the testing data was not done correctly, and steps must be taken to resolve this issue.
### Outlier Detection
Outlier detection, also known as anomaly detection, is a crucial task in the field of data analysis and machine learning. Outliers are observations in a dataset that lie far away from other points and do not fit the pattern of the majority of the data. Outliers can be caused by various factors such as measurement error, experimental error, or even fraud. Identifying outliers is essential in many real-world applications, such as credit card fraud detection, medical diagnosis, and quality control.
There are several methods and models that can be used to detect outliers. Here, we use three common models used in outlier detection: Isolation Forest, Local Outlier Factor (LOF), and One-Class Support Vector Machine (One-Class SVM).
Isolation Forest is a popular unsupervised machine learning algorithm for outlier detection. It works by isolating observations using decision trees. The idea behind Isolation Forest is that outliers will require fewer splits to isolate than inliers.
Local Outlier Factor (LOF) is another unsupervised algorithm for outlier detection. It measures the local deviation of a given data point with respect to its neighbors. The LOF algorithm assigns a score to each observation, with a higher score indicating a higher degree of abnormality.
One-Class Support Vector Machine (One-Class SVM) is a supervised machine learning algorithm used for outlier detection. It works by training a model on the majority of the data and then using this model to identify observations that deviate significantly from the majority. One-Class SVM can handle high-dimensional data, making it a useful tool for detecting outliers in complex datasets.
#### Custom Outlier Remover
Custom outlier wrapper is developed to remove outliers from the data. The outlier wrapper takes in an outlier detector estimator from scikit-learn and removes any outliers that are marked by the estimator. Outlier detectors in scikit-learn typically mark data points as outliers, but they do not remove them from the data.
The problem with using the scikit-learn pipeline is that it doesn't allow for the modification of the X and y values, which is necessary in order to remove the outliers from the data. To overcome this limitation, imbalanced learn's pipeline is used instead. This pipeline builds off of the scikit-learn pipeline, but it allows for the transformation and modification of X and y through the use of the FunctionSampler class wrapper.
The custom outlier wrapper, therefore, combines the use of the outlier detector estimator from scikit-learn with the imbalanced learn's pipeline to remove outliers from the data. The custom outlier wrapper is an important step in the data pre-processing phase as it ensures that the data is free of any outliers that may affect the results of the analysis.
#### Democracy Detect
Lastly, we introduce <code>DemocracyDetect</code>. This acts as an ensemble of the outlier detection methods under a 'democratic' process. It operates under a democratic process, where a data point is considered an outlier if a majority of the outlier detectors agree. It effectively combines the results from multiple outlier detection methods to make a final determination.
#### NaNMarker
**NanMarker.** What is a NanMarker? By default, outlier detection methods such as IsolationForest are unable to handle Nan values on their own. However, as we have previously mentioned, we want to firstly remove the outliers before we actually remove to the imputation stage, as the presence of outliers may affect the weights inside the imputer. How can we deal with Nan values then? To do this we will firstly:
1. Identify and create a boolean mask that represents the position of all the null values
2. Store this boolean mask
3. Fill all the null values using the median of the respective columns. Note we use median, as the value that is used to impute will be exactly in the 50th percentile, and thus won't affect the output of the outlier detector. Additionally, note that we fill it according to their respective columns as the columns are on different scales.
4. Perform outlier detection and remove the outlier values.
5. After removal, use the nan mask and replace the nans back in.
### Imputation
Data imputation is the process of replacing missing values in a dataset with an estimation based on existing values. It is commonly used in data analysis to prevent the loss of important information due to incomplete data. It is essential to understand the different methods of imputation available and when to use each one in order to produce reliable and accurate data. Below are the two imputers that we will be using in our analysis.
**Front Fill Imputer**
FrontFillImputer is a method of imputation that replaces missing values with a value from a previous observation. As the nature of the dataset in this assignment is essentially that of a time series, front fill is a potentially very powerful yet simple imputation method of imputation. Below, we define our own custom front fill imputer as it is not supported in scikit-learn's library by default.
**Rolling Mean Imputer**
RollingAverageImputer takes the average of existing values in the dataset to replace missing values. In other words, it performs a sliding window operation, where for some missing value, it performs an average of the last three values. This acts as a more robust and conserved version of front fill imputer, as it has a larger 'look-back' effect.
### Feature Engineering
The features that we engineer to use in generated features similar to the CA1 assignment. This includes aggregations such as Min, Max and Mean, as well as extracting date information from the bookingID.
### Oversampling
Oversampling is an important technique in handling unbalanced datasets, where the number of observations in one class is significantly lower than the number of observations in the other class(es). In such cases, traditional machine learning algorithms can be biased towards the majority class and may not perform well on the minority class.
This involves duplicating the minority class observations in the dataset to balance the class distribution. This is done in order to ensure that the machine learning algorithm is trained on a balanced dataset, and has equal representation of each class. By doing so, the algorithm is able to learn the characteristics of the minority class, which may have otherwise been neglected due to the unbalanced class distribution.
**SMOTE**
The basic idea behind SMOTE is to create synthetic samples of the minority class by interpolating between existing samples. This is done by finding the k nearest neighbours for each minority class sample, and then generating a new synthetic sample at a random point between the original sample and one of its neighbours. The process is repeated for all minority class samples.
**RandomOverSampler**
Duplicates examples from the minority class in the training dataset.
**ADASYN**
ADASYN is an algorithm that generates synthetic data, and its greatest advantages are not copying the same minority data, and generating more data for "harder to learn" examples.
### Feature Selection
Feature selection is an important step in any machine learning project as it helps to reduce the curse of dimensionality, improve the interpretability of the model, and increase the performance of the model by reducing overfitting. The curse of dimensionality refers to the phenomenon where the performance of a machine learning algorithm decreases as the number of features increases. This is because a high-dimensional feature space increases the complexity of the model, and reduces the ability of the model to generalize to new data.
There are several methods for feature selection, and in this notebook, we will be discussing three of the most common ones: **Recursive Feature Elimination (RFE)**, **Sequential Feature Selector (SFS)** using f1-score as the scoring metric, and **SelectFromModel**.
**RFE** is a backward selection technique that involves recursively removing the feature with the lowest importance until the desired number of features is reached. This method is based on the assumption that the most important features are the ones that are selected first, and the least important features are removed last.
**SFS** is a forward selection technique that involves adding one feature at a time to the feature set, based on the performance of the model using a specified scoring metric. In this notebook, we will be using the f1-score as the scoring metric.
**SelectFromModel** is a method that uses a pre-trained machine learning model to select the most important features. This method works by training the model on the entire feature space, and then selecting the features that have the highest importance based on the coefficients of the model. This method is useful when the feature space is large, and the goal is to reduce the dimensionality of the feature space while maintaining the most important information.
**We do not any to make any custom wrappers, as these classes work smoothly with imbalanced learn's pipeline**
### Models
Below are the models we attempt to try:
- ExtraTreesClassifier
- RandomForestClassifier
- GradientBoostingClassifier
- HistGradientBoostingClassifier
- XGBClassifier
- CatBoostClassifier
- AdaBoostClassifier
- StackingClassifier
We selected these models due to the strong performance of both Tree Based Models, as well as gradient boosting methods.
Tree based models are a type of decision tree algorithm that utilizes a tree structure to make predictions. The algorithm splits the data into smaller subsets and creates decision rules to predict the outcome of each subset. Tree based models are easy to interpret, flexible, and can handle non-linear relationships between variables.
Gradient boosting methods, on the other hand, are an ensemble learning technique that combines multiple weak learners to form a strong model. These methods work by iteratively adding new trees to the model, where each tree tries to correct the mistakes made by the previous trees. Gradient boosting methods are known for their strong performance, especially in cases where the data has a high degree of non-linearity.
### Pipeline Framework
We decided to aim at creating a main pipeline for a data science process. This pipeline will utilize a combination of several powerful libraries in Python for data manipulation and machine learning. The libraries to be used are:
- Imbalanced Learn pipeline allows us to transform the X and y with ease (especially with outlier removal)
- scikit-learn (sklearn): a widely used Python library for machine learning
- dask-ml: a library for scalable machine learning in Python, built on top of dask
- dask: a flexible library for parallel computing in Python
The API styles of the libraries have similarities, but there are also differences due to nuances in the libraries. For example, there are differences between the pipeline in Imbalanced-Learn and the pipeline in scikit-learn. Specifically, the Imbalanced-Learn pipeline allows for the transformation of both X and y, while the scikit-learn pipeline only allows for the transformation of X. Additionally, there are differences between a dask dataframe and a pandas dataframe.
In order to achieve a smooth integration between the various technologies utilized in this project, such as dask arrays, dask dataframes, dask-ml preprocessing, the Imbalanced-Learn pipeline, Imbalanced-Learn oversampling, and scikit-learn models, it was necessary to write custom imputation classes and wrappers from scratch. This was done with the goal of supporting parallel processing, which is crucial for efficient and effective data processing in a data science project.
The writing of these custom imputation classes and wrappers required a significant amount of effort, as each of the technologies utilized in this project has its own unique strengths and challenges. By developing custom wrappers, we were able to build bridges between the different technologies and ensure that they could all work together harmoniously. This, in turn, allowed us to create a main pipeline that leverages the strengths of each library and makes the data science process as streamlined and efficient as possible.
It is worth noting that despite the challenges in mixing these technologies, the end result of having a seamless and integrated main pipeline can bring many benefits. It enables data scientists to perform complex data processing tasks with ease and reduces the risk of errors occurring due to manual intervention. Additionally, having a main pipeline that utilizes parallel processing can dramatically speed up the data science process, making it possible to work with large datasets in a timely manner.
### Modeling and Optimization
First, we use randomized search cross-validation (CV) to narrow down the selection of models. Randomized search CV is a randomized search algorithm that samples hyperparameters from a specified distribution to find the best combination of hyperparameters. This allows us to quickly identify a set of hyperparameters that are likely to perform well. Note that we perform this completely streamlined through our pipeline, which prevents any data leakage from happening.
Next, we use bayesian optimization to further hyperparameter tune the final estimator. Bayesian optimization is a probabilistic model-based optimization method that is more efficient than grid search or random search, especially for high dimensional parameter spaces. It uses a Bayesian model to predict the performance of the different hyperparameter combinations and makes a more informed decision about which hyperparameters to try next. This leads to more accurate hyperparameter tuning and better model performance.
### Evaluation
The F1 score is a commonly used evaluation metric in machine learning, especially in imbalanced datasets. It is a balanced combination of precision and recall, and is calculated as the harmonic mean of these two metrics.
Precision measures the proportion of true positive predictions out of all positive predictions, whereas recall measures the proportion of true positive predictions out of all actual positive instances. The F1 score is a better measure of performance in imbalanced datasets because it takes into account both the precision and recall of the model, whereas accuracy can be misleading in such scenarios.
For example, if there are very few instances of the positive class, a model that always predicts negative will have a high accuracy but low precision and recall. The F1 score is particularly useful in this case, as it will be low, indicating that the model is not performing well. In our case, since we are focused on the positive class, it makes sense to use the F1 score of the positive class (the minority class) as our evaluation metric.
### Conclusion
Learning points:
- We utilized custom transformers and wrappers to make the pipeline smooth, allowing us to handle the unbalanced nature of the dataset, while also making the most of the processing power of dask and sklearn. This is an important point, as it demonstrates how to create data transformations and modeling processes that allow us to effectively handle large datasets and complex modeling problems.
- The second learning point is the use of dask for handling large dataframes. Dask allowed us to perform dataframe operations quickly, without the need for a high-performance computing setup. This is a useful skill for those looking to handle large datasets without having access to specialized hardware.
- By utilizing dask_ml, we were able to take advantage of the parallel processing power of dask to speed up the training and evaluation of our models. This is a particularly important point, as it demonstrates how to effectively scale up your modeling pipeline to handle larger datasets, while still maintaining the accuracy of your results.
- Careful and rigorous approach to feature selection and model tuning that was taken. By utilizing a combination of Recursive Feature Elimination, Sequential Feature Selector, and SelectFromModel, we were able to identify the most important features in the dataset.
- Utilizing randomized search cv and bayesian optimization, we were able to determine the optimal hyperparameters for our models. This careful and thoughtful approach to model selection and tuning is critical for ensuring the robustness and generalizability of the results of our analysis.