Cross-validation in Machine Learning, cross-validation is a word that everyone who works with machine learning techniques will come across at some point.
We provide you with a quick overview of cross-validation in this blog post.
Assume for the moment that your goal is to model some data in order to categorize or forecast additional data points.
You may frequently receive emails, some of which are spam. You may fit a model for determining whether a new email is a spam or not based on the text of the mail by using your historical emails, all of which have been classified as spam or not.
To do so, you can employ a variety of modeling techniques, such as Support Vector Machines, a random forest, or a binary or multinomial logistic regression model.
You should evaluate a classification or prediction model’s actual performance for fresh data points. In other words, you are interested in how well it generalizes to new data.
You normally divide your available data into training and validation data at random for that. You fit your model based on the training set of data.
You use your fitted model to create predictions or classifications based on the validation data.
You may determine how well the model actually performs for predicting or classifying new data points by comparing the actual results in the validation data (for example, the labels indicating whether or not an email is a spam) with the predictions or classifications of the model.
To put it another way, you may estimate how effectively a model generalizes for out-of-sample predictions or classifications based on the validation data.
By dividing your data into training and validation sets, you may compare various models, select your model’s parameters, and execute the model selection.
For a random forest technique, for instance, you can use model validation to choose how many trees to merge, how deeply you want these trees to become nested, and how many features to randomly choose in each tree.
Keep in mind that the holdout set is another name for the portion of the data utilized for validation.
Training data, validation data, and test data are frequently used as the three subsets of the data rather than the more common two.
The validation data can be used to fine-tune your model, such as choosing the random forest algorithm’s parameters, or – more generally – to choose a model from among contenders.
In order to determine how successfully the final model predicts or categorizes fresh data points, you can use the test set, which wasn’t used to model the data, or evaluate the model using various parameter settings.
We advise you to look at the following materials for a more thorough grasp of the theoretical underpinnings of model validation and the calculation of a model’s generalization error:
Bertrand Clarke, Ernest Fokoue, and Hao Helen Zhang’s book, Principles, and Theory for Data Mining and Machine Learning (especially Chapter 1.3.2)
Explain how validation determines the actual risk of an algorithm in Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David, particularly Chapter 11.
The Elements of Statistical Learning by Robert Tibshirani, Jerome Friedman, and Trevor Hastie (especially Chapter 7)
Up until now, all we’ve really said is that we could validate a model using the validation data.
How do we go about doing that, though?
On the basis of the validation data, we choose a loss function and determine its value. In order to determine the chosen distance between the observed labels and values in the validation data and the model predictions, we plug them into the loss function.
The data at hand determines which loss function should be used. The squared error loss (=L2 loss) is a well-known loss function; for further information, refer to The Elements of Statistical Learning, Chapter 2.4.
The squared disparities between the actual values and their forecasts are totaled by the squared error loss.
The L1 loss, which totals the absolute differences between data and forecasts, is another well-known loss function.
A word of warning: Take note that one should be careful not to make an analysis based on the full data, such as choosing auxiliary variables for a prediction model, and then split the data into training and validation.
This is illustrated in detail in The Elements of Statistical Learning, Chapter 7.10.2.
It is possible to draw incorrect conclusions about the validation data from previously acquired knowledge about the entire dataset, such as inaccurate estimations of a model’s generalization error.
In a perfect scenario, the size of the data we have available will allow for a smooth random split into training and validation or training, validation, and test data.
However, we frequently do not have a lot of data points at our disposal in real applications. We are aware that as a model is fed with additional data, it usually becomes more accurate.
Check out this article on sample sizes for machine learning algorithms as an illustration. Consequently, we have a motivation to train the model using as much data as possible (s).
On the other side, the more data we utilize, the more accurate (lower variance) our model performance estimations based on the validation data are.
We can utilize cross-validation to get around the trade-off between using as much data as feasible for two different subsets for training and validation.
In cross-validation, we repeatedly split training and validation data at random, and then we select to integrate the findings of the many splits into one measure.
The model testing is still done on a separate test set, and cross-validation is normally only utilized for model and validation data.