{%hackmd theme-dark %} # Joni Dambre ## Fundamental concepts ### **Over- and underfitting**: What are they? How do you detect them? What causes them? What is the relation with the model, with the training approach, with the data, with specific properties of the data? Can you have both? Overfitting occurs when a machine learning model is too complex and is able to fit the training data too well, but fails to generalize to new, unseen data. This can happen when the model has too many parameters relative to the size of the training dataset, or when the model is trained for too long. A model that is overfitted will have low bias and high variance. It will make very accurate predictions on the training data, but will perform poorly on new data. Overfitting can be detected by evaluating the model's performance on the training data and comparing it to the performance on the test data. If the model performs well on the training data but poorly on the test data, it is likely overfitted. Other signs of overfitting include a low training error and a high validation score. Underfitting, the opposite of overfitting, is a modeling error that occurs when a model “has not learned enough” from the training data, resulting in a high bias and low variance. It can be caused by a too-simple model, too less generalization, or too few informative features. Underfitting can be detected by evaluating the model's performance on the training data and comparing it to the performance on the test data. If the model performs poorly on both sets, it is likely underfitted. Other signs of underfitting include a high training error and a low validation score. Yes, it is possible to have both when you have both high bias and high variance. If the training data is “bad”, a model can never be good, however complex you make it, but it can still overfit on the training data if you make it complex. Examples of “bad” data are: (1) features are not informative (e.g. predicting sales based on pure noise) and (2) data is not representative: data bias (e.g. object detection on birds in South America while the model is trained with birds from another continent). The relation with the model is: &emsp; - Overfitting: typically caused by a too complex model &emsp; - Underfitting: typically caused by a too simple model The relation with the training approach is: &emsp; - Overfitting: for example, if the model is trained for too long, it may begin to memorize the &emsp;&emsp;training data rather than learning the underlying pattern. &emsp; - Underfitting: if the model is not trained for a sufficient amount of time, or if the learning rate &emsp;&emsp;is too high, it may not be able to learn the underlying pattern in the data. &emsp;&emsp;The relation with the data is: &emsp; - Overfitting: the data set used for training affects the final model a lot, the model will fit very &emsp;&emsp;accurately on the training set but not on a real world example &emsp; - Underfitting: the data set used for training is not representative enough for the real-world &emsp;&emsp;situation, contains not enough information The relation with the specific properties of the data is: &emsp; - Overfitting: noise and outliers can make it more difficult for the model to generalize to new data. &emsp; - Underfitting: noise and outliers can make it more difficult for the model to learn the underlying pattern in the data. ### **Bias** and **variance** of a model: What are they? How do you detect them? What causes them? What is the relation with the model, with the training approach, with the data, with specific properties of the data? Bias is the expectation across models that are trained on different subsets of data, of the expected model error on unseen data compared to the ground truth. It is the capability of an estimator to approximate the ground truth. Too high bias can be caused by a too simple model (underfitting), data that does not contain the necessary information, or training data that is not representative. Variance is the variance across models that are trained on different subsets of data, of the expected model error on unseen data compared to the ground truth. It is the estimator training procedure's sensitivity to the training data's variability. Too high variance can be caused by a too complex model (overfitting), not enough data to the models' complexity, or too much unimportant variability, e.g. label noise, and many outliers. We can only estimate the bias and variance of a model: &emsp; - because the labels in the validate and test sets may contain noise (so they are only an approximation of the ground truth) &emsp; - and because of the finite size of the validate and test sets! As we increase complexity (with the same data samples), &emsp; - bias decreases (a better fit to ground truth) and &emsp; - variance increases (fit varies more with data) This can be seen in the bias-variance decomposition. ### **Bias in the data**: What is it? How does it affect your model? Relation between over/underfitting and bias/variance Bias in the data occurs (1) when the training data is not representative for a real world situation; e.g. face recognition in China or US, OR (2) when the distribution across training labels is biased toward certain human interpretations/opinions (in supervised learning), e.g. historical bias in recruiting. Data bias can be considered as a violation of the ‘identically distributed’ part of the i.i.d. assumption. In general, overfitting occurs when a model is too complex and is able to fit the training data too well, but fails to generalize to new data. This can be caused by high variance in the model, which occurs when the model is sensitive to the specific patterns in the training data. Underfitting occurs when a model is too simple and is unable to capture the underlying pattern in the data. This can be caused by high bias in the model, which occurs when the model is not flexible enough to capture the pattern in the data. Bias in the data can contribute to both overfitting and underfitting. If the data is biased, the model may learn to make biased predictions, leading to overfitting. On the other hand, if the data is too biased, it may be difficult for the model to learn the underlying pattern, leading to underfitting. Usually: If you keep the amount of data and the model type fixed, but tune the model complexity, bias, and variance trade-off. ### The **i.i.d.** assumption(s): What are they? Why are they important? How do you ensure them? What are possible deviations? How do these affect your model and its performance? Make sure you truly understand the differences between “independent” and “identically distributed”! The independently and identically distributed assumption assumes that the observations are sampled independently and identically distributed from the ‘world’. The measured features follow the same underlying distribution as those in the world. There are no subgroups of samples that are correlated, and if so, caused by hidden latent variables, we need to take them into account. The i.i.d. assumption is important because it allows us to make statistical inferences about the population based on the sample. If the assumption holds, then we can use the sample to draw conclusions about the population. If the assumption does not hold, then our conclusions may not be valid. To ensure the i.i.d. assumption, it is important to randomly sample the data from the population. This helps to ensure that the sample is representative of the population, and that the data points are independent of each other. There are several deviations from the i.i.d. assumption that can affect the performance of a model. For example, if the data is not independent (e.g., if there is some correlation between the data points), then this can affect the model's ability to generalize to new data. If the data is not identically distributed (e.g., if the training and test sets have different distributions), then this can also affect the model's performance. In general, deviations from the i.i.d. assumption can lead to a model that is over- or under-fit to the data. ### **Data leakage**: What is it? How can it occur? How does it affect your model and its performance? How do you prevent it? Leakage is the leakage of information from the validation or test set into the training set. It can occur when the validation/test data correlate with the train data (violation of the independent assumption) or when a model or preprocessing parameters are tuned on the validation and/or test set. It affects the model by too optimistic results for the validation and test data but disappointing results for in ‘in the wild’ evaluation. It can be prevented by independently splitting data in train/validate/test sets and only tuning on the train set, of course preferably without overfitting. ### Train-validate-test: Why 3 sets? Specific considerations when creating them? The training set is used to fit the model, so it is important to have a large enough dataset to ensure that the model is able to learn effectively. The validation set is used to tune the hyperparameters of the model, so it is important to have a large enough dataset to accurately evaluate the performance of the model as the hyperparameters are varied. The test set is used to evaluate the final performance of the model, so it is important to have a large enough dataset to accurately estimate the performance of the model on new data. When creating the three sets, it is important to consider the size of each set and the proportions of the different classes in each set. It is also important to ensure that the sets are representative of the overall dataset, and to avoid data leakage between the sets. ### “The gaps” (train-validate & validate-test): What are they? What can cause them? What are mitigations? “The gaps” refer to the distance between two learning curves for train-validate or validatetest performance. The train-validate gap can be caused by (1) overfitting as the model is too complex for a given data set or the model is not properly regularised, or too many features; tuning ranges should be investigated by validation curves. Also, this gap can be caused by (2) violation of the identically distributed assumption; a stratified and/or grouped split may be necessary, look at individual fold validation scores as some groups may be harder than others. Further, (3) other causes are possible and must use domain knowledge and brains to solve it. Ideally, the train-validate gap should shrink with more data. The test-validate gap can be caused by (1) leakage between the train and validate set; the validation split may not be independent but also leakage in the preprocessing step is possible: everything that is computed from data MUST be recomputed for each fold. Also, this gap can be caused by (2) overfitting on the validation set when tuning many parameters for small datasets with a lot of variability; use more folds or parameters with less variability. Further, (3) violation of the identically distributed assumption is possible and other causes are possible; use domain knowledge and brains to solve it. Some mitigations: • Use a larger training set: A larger training set can help the model to learn more effectively, reducing the gap between the training set and the validation set. • Use regularization: Regularization techniques, such as L1 and L2 regularization, can help to prevent overfitting and reduce the gap between the training set and the validation set. • Use cross-validation: Cross-validation can help to mitigate the gap between the training set and the validation set by using multiple validation sets and averaging the performance across all the sets. • Use a validation set that is representative of the test set: By ensuring that the validation set is representative of the test set, you can reduce the gap between the validation set and the test set. • Use data augmentation: Data augmentation can help to increase the size of the training set and improve the model's ability to generalize to new data, reducing the gap between the training set and the validation set. If gaps remain, often safe to choose simpler model: higher bias and lower variance can avoid unpleasant surprises in the real world. # Tom Dhaene <ul> <li style="list-style-type: none;"> identation </li> </ul>