Data Distribution Shifts
Types of Data Distribution Shifts
Подробнее читать в этих книгах:
- Designing Machine Learning Systems (Chip Huyen, 2022).
- Dataset Shift in Machine Learning (Joaquin Quiñonero-Candela, Masashi Sugiyama, 2008 — The MIT).
Ниже приведены prompt и неплохое объяснение от ChatGPT
. Я специально попросил сделать его простым языком.
You are a professional Data Scientist and ML Engineer.
Your task is to explain in a language accessible to other people how Covariate shift, Label shift and Concept drift
differ.
Let's imagine we have a scenario where we are trying to predict if an email is spam or not using a machine learning model.
-
Covariate Shift: Imagine that the emails you're receiving start to change over time. Maybe initially, most spam emails included certain keywords like "free" or "winner," but now spammers have gotten more sophisticated, and their emails look more like regular correspondence. The characteristics (or features) of these emails have shifted. This is what we refer to as covariate shift. In more technical terms, covariate shift occurs when the distribution of input data (features) changes, but the conditional distribution of the labels given the input data remains the same. For your spam filter, this means the spam emails have changed in nature, but the definition of what constitutes spam has not.
-
Label Shift: On the other hand, label shift is when the distribution of the labels (the thing you're trying to predict) changes. What if suddenly more people started using email for marketing, and so a higher proportion of your incoming emails were spam? The overall proportion of spam to non-spam in your inbox has changed, which is label shift. Technically, label shift means that the distribution of the output labels changes, even though the conditional distribution of the input data given the labels remains constant. So, if an email is labeled as spam, the kind of content it contains does not change, but there are simply more or fewer spam emails than before.
-
Concept Drift: Concept drift is a broader term that applies when the relationship between the input data and the output labels changes. This could mean that the spam emails not only started to look different (covariate shift), but what is considered spam also changed (label shift). For example, new laws might redefine what kinds of emails are considered spam. Now your model not only has to adapt to new kinds of spam emails, but it also has to understand a new definition of spam.Technically, concept drift occurs when the joint distribution of inputs and labels changes. This means that the underlying concept of what your model is trying to predict is no longer stable; it's drifting. Your model might start performing poorly because it was trained on an old concept of spam, which no longer applies.
To make these concepts more accessible:
- Covariate shift is like suddenly having to play soccer with a ball that behaves differently, but the rules of the game haven't changed.
- Label shift is when the rules of the game remain the same, but the composition of the teams has changed (e.g., more aggressive players have joined).
- Concept drift is akin to playing soccer and finding out that not only has the ball changed, but some of the rules of the game have been modified as well.
In real-world applications, it's important to monitor your machine learning models for these shifts and drifts, and retrain them with new data reflecting the current situation to maintain their accuracy and usefulness.