MLOps Series: Introduction to MLOps, Data Drift, Concept Drifts, and how to handle them in ML Production
Contents
- Introducing MLOps
- Components of MLOps
- Introduction to Data Drift and Concept Drift
- Types of Drifts
- Methods to detect drifts
- Steps to take when there is an occurrence of drift
- Ways to handle Drift in Production
In this series of MLOps, I will be discussing various concepts surrounding MLOps. This article focuses on introducing MLOps and giving a landscape understanding of the various data drifts and concept drifts and how to detect and handle them in production.
Introducing MLOps
MLOps is an abbreviation for Machine Learning Operations. MLOps is a basic component of machine learning engineering that focuses on optimizing the process of deploying machine learning models and subsequently maintaining and monitoring them. MLOps is a valuable technique for developing and improving the quality of machine learning and AI solutions. By integrating continuous integration and deployment (CI/CD) procedures with adequate monitoring, validation, and governance of ML models, data scientists and machine learning engineers may cooperate and accelerate model development and production by using an MLOps strategy.
The need for MLOps is evident now as it is challenging to commercialize machine learning solutions. The machine learning lifecycle comprises many complicated components, including data collection, data preparation, model training, model tuning, model deployment, model monitoring, explainability, and much more. It also necessitates cross-team communication and hand-offs, from Data Engineering to Data Science to ML Engineering. High operational rigor is required to maintain all of these processes synchronized and running in unison. MLOps aids in experimentation, iteration, and continual improvement in the machine learning lifecycle.
Some of the advantages of MLOps are:
a) Efficiency: MLOps enables data teams to design models quicker, offer higher-quality ML models, and deploy and produce models more quickly.
b) Scalability: MLOps also offers massive scalability and management, allowing thousands of models to be overseen, controlled, managed, and monitored for continuous integration, continuous delivery, and continuous deployment. MLOps, in particular, allows for the repeatability of ML pipelines, allowing for more tightly tied cooperation across data teams, decreasing conflict with DevOps and IT, and speeding release velocity.
c) Reduced risk: Machine learning models frequently require regulatory monitoring and drift-checking, and MLOps offers better transparency and faster response to such requests, and higher compliance with an organization’s or industry’s rules.
Components of MLOps
Exploratory data analysis (EDA) — Create repeatable, editable, and shareable datasets, tables, and visualizations to iteratively explore, share, and prepare data for the machine learning lifecycle.
Data Preparation and Feature Engineering — Transform, consolidate, and de-duplicate data iteratively to develop enhanced features. Most importantly, to make features accessible and shareable across data teams.
Model training and tuning — To train and enhance model performance, use popular open-source tools like scikit-learn, TensorFlow, and PyTorch. Use automated machine learning techniques like AutoML to execute trial runs and generate reviewable and deployable code as a more straightforward option.
Model review and governance — entails tracking model lineage and versions and managing model objects and transitions throughout their existence. Using an open-source MLOps platform like MLflow, you can discover, share, and collaborate across ML models.
Model inference and serving — Control the frequency of model refresh, inference request timings, and other testing and QA-specifics. To automate the pre-production workflow, use CI/CD technologies.
Model deployment and monitoring — Automate permissions and cluster building to make registered models production-ready. Allow REST API model endpoints to be enabled.
Automated model retraining — Create warnings and automation to take remedial action in model drift due to discrepancies in training and inference data.
Let us simplify the ML lifecycle to understand why MLOps is important. The above ML lifecycle can be simplified into 4 phases: Project Scope, Collecting Data, Modeling, and Deploying the model. We often think that these phases follow the waterfall approach, and they execute one after the other in sequence, but in reality, it is not the case, and the whole ML lifecycle is an iterative process.
The iterative nature of the ML lifecycle can be demonstrated clearly with the help of the above diagram. To explain this, let us consider an example of a system that takes an image of a product ( Handbag, Watch, or Shoes ) and predicts the brand of the product, and estimates the price for the same.
1. Project Scope: The system must be able to take a product image as an input and should classify the brand and estimate the price of the product. We understand that this is a classification plus regression problem. After clearly defining the scope, we will proceed to the next step.
2. Data Collection: We will collect product images for Handbags, Watches, and Shoes across various brands and also scrape the price for those products. We will also focus on preparing a clean dataset to work within this phase.
3. Modeling: We will proceed with writing code for various Convolutional Neural Networks to classify the product’s brand and write code for image regression. Let us assume DenseNet201 performs best among all models. We will then validate the results and check how the model performs across various subsets of our data, and we find that the model cannot perform well on Louis Vitton bags. Instead of proceeding to deployment, we will jump back to data collection of Louis Vitton bags and check the model performance. After this, if we are satisfied with the model’s performance, we will proceed to deploy it.
4. Deployment: After deployment, the model works fine for some time, but during monitoring, the accuracy drops. This may happen due to Data Drift (In this article, we will discuss Data Drift / Concept Drift in detail later ). Nevertheless, we may jump back to the modeling or data collection phase as it is not performing well.
This example clarifies that the ML lifecycle is an iterative process rather than a straightforward sequential execution of the four phases.
Introduction to Data Drift and Concept Drift
Let us consider the scenario where you have developed a model that predicts the price and brand of the product image, and it has excellent performance. For simplicity, the performance metric is accuracy for classification, RMSE for regression, and your model has 89% accuracy with an RMSE of 400$. However, a few months after deployment, your model is not doing well. The performance has dropped to 70%, and RMSE has increased significantly. Does this indicate that our model is wrong? Not necessarily; Data Drift causes this situation.
Data evolves in most big data analysis applications and must be evaluated and processed in near real-time. Patterns and relationships in such data frequently change over time. Therefore models created to analyze such data soon become obsolete. This phenomenon is known as data drift in machine learning. When data changes in general, this is referred to as data drift, whereas changes in the context of the goal variable are referred to as concept drift. Both of these drifts contribute to model deterioration but must be handled independently.
We anticipate that our model will function similarly to how it did with training data. However, if the distribution of production data differs from training data, Model decay may occur. Model decay is the deterioration of a model’s predictive power. When the distribution of data in production differs from that in training data, this is referred to as data drift. The model would still perform well on data similar to the “old” data on which it was trained. Model decay occurs when:
1. Inadequately sampled training data
2. The underlying business situation has changed.
Types of Drifts
Class drift: Class drift happens when the posterior class probabilities P(Y | X) fluctuate. It is also known as actual concept drift or prior probability shift. For example, patients’ health issues may change over time, resulting in a change in the set of health attributes associated with the target variable. Class drift is further subdivided into two categories based on the extent of the drift. Drift scope, also known as drift severity, is the fraction of X’s domain for which P(Y | X) changes. Drift scope influences how easy it is to identify changes in the stream and how much of a model has to be updated. Subconcept drift, also known as intersecting drift, occurs when the drift scope is restricted to a subspace of X. For example, a data stream dealing with virus records may have a “ virus “ class, among others. If a new form of the virus is developed, the conditional probabilities of the virus class occurring will change, but only for those contexts that relate to the new form of the virus. While this is happening, cases that involve other forms of the virus could remain the same and may continue in the same way as before. Full-concept drift, also known as severe drift, occurs when the posterior class distribution for all categories of objects changes.
Covariate drift: In the literature, covariate drift, also known as virtual concept drift, happens when the distribution of non-class properties, P(X), varies with time. Consider a company that predicts client behavior based on socioeconomic data. The demographics of the client base may change over time, resulting in a change in the likelihood of each demographic element.
Novel Class Appearance: A novel class appearance is a subset of concept drift in which a new class emerges. Consider a company that predicts which choice a person will pick on a web page. A new class is established when a new option is added to the web page.
Drift magnitude: Drift magnitude can be minor drift or major drift. Depending on Drift magnitude, we can handle the subsequent steps. If there is a minor drift, it is likely to be suitable to keep an accurate model for old data distribution and simply modify it when data concerning a new concept is collected. In contrast, if there is a sudden and significant shift, it may be appropriate to discard the prior model and begin again with the data regarding the nature of the new notion as it becomes available.
Drift Frequency: The frequency with which concept drifts occur during a specific period is called drift frequency. A high frequency suggests that fresh idea drifts begin within a short period of each other, whereas a low frequency shows that drifts occur at extended intervals.
Drift Duration: When a stream with concept X abruptly transforms to concept X+y, this is abrupt drift. A market crash is a real-world example of sudden drift. Almost instantaneously, in a stock market stream, stock prices will alter and follow a different pattern than what it was previously. Blip drift is a special case of abrupt drift coupled with a very short concept duration. In blip drift, the blip concept replaces the dominant concept for a very short period — for example, a situation like the Black Friday sale. There is also an extended drift. A recession is a real-world example of extended drift. Unlike a market crash, stock values at the start of a recession will fluctuate gradually over time. Eventually, movements in all stock prices will follow a different pattern than before the crisis began.
Drift Transition: If there is a gradual drift from concept X to concept X+y over a set period, we call it a gradual drift. If, after each step, the distance between Concept X and concept X+y decreases, if there is a steady progression towards the new concept, we call it Incremental Drift. Moreover, a Probabilistic drift happens when two alternating conceptions coexist, with one initially predominating and the other gradually taking control. When a sensor network node is replaced, a stream experiences probabilistic drift. A node cannot be switched instantly; the replacement node must be checked to verify it is operationally sound. As a result, the two nodes will be switched on and off as the tests are run. In terms of the sensor network stream, samples from two separate concepts, one from the malfunctioning node and one from the new node, are coming from that node. The data from the new node will grow increasingly likely to appear in the stream until only the concept from the new node remains.
Drift Recurrence: If a particular pre-existing concept repeatedly influences the current concept, it would be called a Drift recurrence. A phone app might be a real-world example of recurring drift. When a person uses a certain program at home, they may use it differently from when they use it at work. The concepts that would reoccur would be used at home and at the business. These app-use principles would reoccur anytime a user arrived at home or at work. Cyclical drift is a form of recurring drift that occurs when two or more concepts recur in a specific order. Cyclic drifts can be of fixed durations if the cycle repeats during a fixed duration or varying duration types. When a cycle takes a constant length of time to complete, this is referred to as fixed frequency cyclical concept drift. For example, the cycle of the seasons must be completed in 365.24 days. Fixed concept duration cyclical drift happens when each period of stability lasts a certain length of time. Cyclic drift with fixed drift duration happens when each episode of drift lasts a defined length of time. Fixed concept onset cyclical drift happens when the start of a stable period occurs at the exact moment in each cycle.
Methods to Detect Drift
A) Statistical Approaches
These approaches use a variety of statistical indicators on your datasets to determine whether the distribution of the training data differs from the distribution of the production data.
- Page-Hinkley method
The means for our features is one of the first and easiest metrics we may examine. If the mean progressively swings in a specific direction for months, data drift is most likely at work. This drift detection approach computes the mean of the observed values and updates it when new data arrives. A drift is recognized when the observed mean at some point exceeds a threshold value λ.
2. Kolmogorov-Smirnov Test
We all know about Student’s T-test, which tells us how likely the samples are from the same distribution. The Student’s T-test gives us the P-value using which we either reject the null hypothesis or fail to reject the null hypothesis. A disadvantage of using the Student’s T-test is that it requires the samples to be normally distributed, which is not always the case while working with real-world data. The Kolmogorov–Smirnov test (KS test) is a bit more complicated and can discover patterns that a Student’s T-Test cannot. The KS test is a nonparametric test of the equality of continuous/discontinuous, one-dimensional probability distributions that may be used to compare a sample to a reference probability distribution that is one sample KS Test. Two samples can also be checked using the two-sample KS Test.
3. Population Stability Index (PSI)
PSI is a single-number measure of how much a population has moved over time or between two separate samples of a population. It accomplishes this by bucketing the two distributions and comparing the percentages of items in each bucket, yielding a single value that can be used to determine how different the populations are. The following are frequent interpretations of the PSI result:
a) PSI 0.1 indicates that there has been no significant population shift.
b) PSI 0.2 indicates that population change is low.
c) PSI greater than 0.2 indicates a significant population shift.
4. Kullback-Leibler (KL) divergence
The Kullback-Leibler divergence (abbreviated KL) represents the distance between the approximation distribution Q and the genuine distribution P. Consider two probability distributions P and Q on some space X. The Kullback-Leibler divergence is defined as
5. Jensen-Shannon divergence
The Jensen-Shannon divergence, abbreviated as JS divergence, is another approach to quantifying the difference (or similarity) between two probability distributions. It uses the KL divergence to get a symmetrical normalized score. This indicates that the divergence of P from Q is the same as the divergence of Q from P.
6. Wasserstein Distance
The Wasserstein distance measures the difference in probability distributions between two distributions. It is also known as Earth Mover’s distance, abbreviated as EM distance, since it may be viewed as the least amount of energy required to move and change a mound of dirt in the shape of one probability distribution to the shape of the other distribution. Unlike the Kullback-Leibler divergence, the Wasserstein metric is an accurate probability metric that considers both the likelihood of and the distance between distinct outcome occurrences. Unlike other distance measures such as KL-divergence, Wasserstein distance provides a meaningful and smooth representation of the distance between distributions. Because of these characteristics, the Wasserstein is well-suited to areas where underlying closeness in the result is more essential than perfectly matching likelihoods.
B) Model-Based Approach
A model approach based on Machine Learning may also be used to identify data drift between two populations. We must classify the data used to generate the current model in production as 0 and the real-time data as 1. We must now construct a model and assess the outcomes. If the model has high accuracy, it can readily distinguish between the two data sets. As a result, we may conclude that a data drift has occurred and that the model must be revised. On the other hand, if the model accuracy is about 0.5, it is as good as a random guess. This indicates that no substantial data shift has happened, and we may proceed to use the model.
C) Adaptive Sliding Window
The Adaptive Windowing (ADWIN) technique uses a sliding window approach to identify concept drift. The size of the window is fixed, and ADWIN slides the fixed window to detect any changes in the freshly arriving data. When two sub-windows in the new observations have separate means, the older sub-window is discarded. A user-defined threshold is specified when drift is detected to trigger a warning. An alarm is issued if the absolute difference between the two means calculated from two sub-windows exceeds the threshold. This strategy only works with univariate data.
Steps to take when there is an occurrence of drift
- Check Data Quality
A drift can be caused due to bad labels, Data entry errors, schema changes, and monitoring should be made for data quality and integrity. Missing data and various data integrity aspects should be monitored. Checking the data quality helps us make sure that the drift is real.
2. Investigate:
While investigating, we simply try to answer the question; Where does the shift come from? The goal is to understand what is happening and interpret the drift by exploring the data changes that might be causing them.
3. Retrain the model:
One approach is to retrain the model with new data. This helps the model to understand the patterns in the newer data distribution, and in turn, this increases the performance of the mode
4. Rebuild the model
Instead of retraining the previous model, one could also rebuild the model with new training data.
5. Pause the model and Fallback
One of the simple approaches is to stop the model. When drift occurs, we can stop using the model and wait till we solve the drift. We can also switch to a different model meanwhile. If the model is performing poorly on one segment of the data, we can pause the running of that functionality. Except for that segment, every other subset of data will work fine.
Ways to handle Drift in Production
An ideal concept drift handling system should be capable of the following:
a) Rapidly adapt to concept drift,
b) Be resistant to noise while distinguishing it from concept drift, and
c) Notice and handle severe drift in model performance.
- Incremental learning
Models in machine learning are frequently trained in a batch context, where the learner is optimized on a batch of data at once. Consequently, a static model is produced that assumes a fixed link between the independent and target variables. As a result, this type of model may need to be retrained over a lengthy period to learn patterns from new data. In practice, most applications rely on real-time streaming data feeds, where the model analyses one sample at a time and can thus be changed on the fly. The incremental learning models are constantly retrained and updated as new data is received. As an outcome, the model is constantly responding to changes in data distribution.
2. Periodic Model re-training
Re-train the model regularly, which can be triggered at various times, such as when the model performance falls below a certain threshold or when the average confidence score between two windows of data shows significant drift.
3. Working with a subset sample of data
We first select a subset of data and come up with a sample representative of the population. When we run the model on this data, we will monitor where the model is performing poorly. Once we have figured out where the model is unsuitable, we take all the data points and re-train on those.
4. Ensemble learning
Ensemble learning with model weighting is a technique in which many models are ensembled, and the result is a weighted average of each model output. A new classifier may be trained and integrated for each new batch of data using a dynamically weighted majority voting technique.
Thank you!
In conclusion, we saw the importance of MLOps, the components of MLOps, and how they interact with each other. We discussed various types of Data / Concept drift. Although it is challenging to detect drifts, it is essential to set up monitoring systems to detect such drifts as fast as possible. I hope you have a good idea of what we discussed in this article. As Andrew NG says, your work till the first deployment is just 50% of the total work, and hence MLOps becomes an integral part of productionizing Machine Learning systems.
Here is my LinkedIn handle!
References
https://databricks.com/glossary/mlops
https://machinelearningmastery.com/divergence-between-probability-distributions/