Countering drifts

![One does not simply hit the retrain button!](https://miro.medium.com/max/700/1*eKCz970pvy1YhE2ONuZXfQ.png) # Countering drifts # Identification of data drift ## How can we track/detect/measure/identify data drifts? ## How to detect model drift ## Main source ### Example tools ### Hypothesis testing ### Monitoring data statistics with TensorFlow Data Validation *** **Hypothesis testing provides a rigorous and automatable way of comparing feature distributions**. **Regardless of how** the drift occurs, it’s critical to identify these shifts quickly to maintain model accuracy and reduce business impact. **Simply retraining the model on all data might not work**. **The most basic way to say if two data samples come from the same distribution is to calculate some simple statistics on both samples and compare them**. It can be done easily with TensorFlow Data Validation. See corresponding section in [Michał Oleszak's article](https://towardsdatascience.com/dont-let-your-model-s-quality-drift-away-53d2f7899c09). - **Evidently**: EvidentlyAI is another open-source tool, which helps in evaluating and monitoring models in production. If you are not using Azure ML and looking for a non-commercial tool that is simple to use, evidentlyai is a good place to start. - **Fiddler AI Monitoring**: [fiddler.ai](https://www.fiddler.ai/) has a suite of tools that help in making the AI explainable, aid in operating ML models in production, monitor ML models. Data and model drift detection is one of them. - **Microsoft Azure ML** provides an automated way to identify data drift which is integrated into Azure ML workspace, this feature is currently in public preview. Azure ML uses statistical methods to identify drift. See [here](https://towardsdatascience.com/why-data-drift-detection-is-important-and-how-do-you-automate-it-in-5-simple-steps-96d611095d93) and [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-monitor-datasets?tabs=python). - **Model-based methods** use a custom model to identify the drift. Models that can be used to develop unsupervised learning model to assess drift on unlabelled data. This approach uses a model to determine the similarity between a given point or groups of points to a reference baseline. While using ML based techniques can give a more accurate picture of drift, explaining the basis of the technique can be a challenge for forming an intuition. **TODO for later:** have a look at the paper [An overview of unsupervised drift detection methods](https://wires.onlinelibrary.wiley.com/doi/full/10.1002/widm.1381). - **Sequential analysis methods**. Like **DDM** (drift detection method)/EDDM (early DDM), that rely on the error rate to identify the drift detection. - **Time distribution-based methods**. Use **statistical distance** calculation methods to calculate drift between probability distributions. Some of the popular distance metrics and nonparametric tests to measure the difference between any two populations are Population Stability Index, **Kullback-Leiber Divergence**, Jenson-Shannon Divergence, [**Kolmogorov-Smirnov Test**](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test), Wasserstein Metric or Earth Mover Distance. For more details, see my other note on [Statistical distances, distribution similarity, divergences, discrepancy](https://hackmd.io/Mal0o8_RTrekJbqztaVqVA). - And finally, if the drift has been significant, it might be necessary to **retune the model’s hyperparameters** to adapt to the new, after-drift world. - Another approach is to **assign weights to the training examples** such that the model pays more attention to the after-drift data. However, it may be that these recent data are not representative of the problem being modeled. - For instance, the after-drift period might miss some events that happened before it, unrelated to the drift. In this scenario, putting more weight on after-drift examples will prohibit the model from learning useful patterns. - If one wants to implement their custom methods one may use libraries like **PageHinkley** from river.drift, and **ADWIN** from skmultiflow.drift_detection. Here is a [basic example for detecting concept drift with the ADaptive WINdowing (ADWIN) algorithm](https://deepchecks.com/how-to-detect-concept-drift-with-machine-learning-monitoring/). - If the data samples are **not normally distributed**, we can still compare them using non-parametric tests. E.g., the Kolmogorow-Smirnov tests directly whether the two samples have been drawn from the same distribution by checking the distance between the empirical distributions (it only works for continuous features). - If the two data samples we wish to compare (e.g. the training data sample and the serving data sample for a given feature) are **normally distributed**, we could run a **$t$-test**. *The t-test can be used, for example, to determine if the means of two sets of data are significantly different from each other*. - If you have **labeled data**, model drift can be identified with performance monitoring and **supervised** learning methods. We recommend starting with **standard metrics** like **accuracy**, precision, False Positive Rate, and **Area Under the Curve (AUC)**. You may also choose to apply your own custom supervised methods to run a more sophisticated analysis. **TODO for later:** have a look at the paper [A survey on concept drift adaptation](https://dl.acm.org/doi/10.1145/2523813). - If you have **unlabelled data**, the first analysis you should run is some sort of **assessment of your data’s distribution**. Your training dataset was a sample from a particular moment in time, so *it’s critical to compare the distribution of the training set with the new data to understand what shift has occurred*. - In practice, the right solution is usually problem- and domain-dependent, and it might involve **ensembling models using old and new data**, or even adding entirely new data sources. - Some models **shift abruptly** — for example, the COVID-19 pandemic caused abrupt changes in consumer behavior and buying patterns; - or even **seasonal/cyclic drift**. - other models might have **gradual drift**; Although easy, visual inspection and comparison of simple statistics between training and serving data involves subjective judgment and is hard to automate. A more rigorous approach is to rely on **statistical hypothesis testing**. Difference between labeled and unlabelled data: How does one decide whether two data samples come from the same probability distribution? In order to identify data drift, it is important to build a repeatable process, define thresholds on drift percentage, configure pro-active alerting so that appropriate action is taken. To spot **data drift**, we need to compare distributions of the input features between training and serving data, feature by feature. Data drifts can be identified using the following techniques. In summary, drift detection is a key step in the ML lifecycle and hence it should not be an afterthought. It should be part of your plan to deploy the model in production, it should be **automated**, careful thought must be given to identify the **drift methodology**, **thresholds** to apply and **actions** to be taken when a drift is detected. Model drift can occur on different cadences: To detect **concept drift**, one should look at the conditional distribution of the true and predicted target given the features. Since serving data is typically coming continuously to the system, this act of monitoring and comparing the distributions needs to take place regularly and be automated as much as possible. What to do once we have detected a drift in our system? The easiest thing to do is to simply **retrain** the model’s parameters on the most recent data, including the data collected after the drift. However, if the drift has been detected early, there might not be a lot of after-drift data. If that’s the case, such a simple retraining won’t help much. [In this note](https://hackmd.io/9yQ3bfFoRI64AEFGWlYOMg), we have defined the two major issues menacing ML models in production: **data drift** and **concept drift**. But how does one go about detecting them? One needs to **monitor model inputs**, **outputs**, and actuals **in both training and serving data**. https://towardsdatascience.com/dont-let-your-model-s-quality-drift-away-53d2f7899c09