Overfitting And Underfitting In Machine Learning

In this article, you will learn what overfitting and underfitting are. You will also learn how to prevent the model from getting overfit or underfit.

Published in

Towards Data Science

5 min readApr 20, 2020

While training models on a dataset, the most common problems people face are overfitting and underfitting. Overfitting is the main cause behind the poor performance of machine learning models. Don’t worry if you have faced the problem of overfitting. Just go through this article. In this article, we will go through a running example to show how to prevent the model from overfitting. Before that let’s understand what overfitting and underfitting are first.

Our main objective in machine learning is to properly estimate the distribution and probability in training dataset so that we can have a generalized model that can predict the distribution and probability of the test dataset.

Overfitting:

When a model learns the pattern and noise in the data to such extent that it hurts the performance of the model on the new dataset, is termed overfitting. The model fits the data so well that it interprets noise as patterns in the data.

The problem of overfitting mainly occurs with non-linear models whose decision boundary is non-linear. An example of a linear decision boundary can be a line or a hyperplane in case of logistic regression. As in the above diagram of overfitting, you can see the decision boundary is non-linear. This type of decision boundary is generated by non-linear models such as decision trees.

We also have parameters in non-linear models by which we can prevent overfitting. We will see this later in this article.

Underfitting:

When the model neither learns from the training dataset nor generalizes well on the test dataset, it is termed as underfitting. This type of problem is not a headache as this can be very easily detected by the performance metrics. If the performance is not good to try other models and you will certainly get good results. Hence, underfitting is not often discussed as often as overfitting is discussed.

Good Fit:

Since we have seen what overfitting and underfitting are, let’s see what good fit means.

The spot in the middle of underfitting and overfitting is the good fit.

In the real world, getting a perfect fit model is very difficult. You will not get the perfect fit model in one go. First, you will have a first cut solution which you will use in the production, and then you will retrain this model on the data you gather over time.

How to tackle overfitting

In this section, I will take a real-world case study to demonstrate how to prevent the model from overfitting. We will be using amazon fine food reviews dataset and train decision tree on this dataset. I will also provide a GitHub link for further reference. I am assuming you all know what are decision tree. We will be using the sklearn library.

You can prevent the model from overfitting by using techniques like K-fold cross-validation and hyperparameter tuning. Generally, people use K-fold cross-validation to do hyperparameter tuning. I will show how to do this by taking an example of a decision tree. First, let me explain what is K-fold cross-validation.

K-fold cross-validation: In this technique, we generally divide the dataset into three parts. The training part contains 80% data and the test part contains 20% data. Further during training, the training dataset is divided into 80:20 ratio and 80% of data is used for training and 20% is used for cross-validation. In decision trees, hyperparameters are the depth of the tree and the number of data points at each node after which the node will be split. We will see how k-fold cross-validation works. Here I will give an example of 5-fold cross-validation. So training data set will be divided into 5 parts and randomly 4 parts will be selected for training and 1 part for validation. This will be repeated 5 times and the average of the metric will be the final metric for that epoch. In one epoch the depth of the tree and split value is fixed. Ex: In an epoch, we can have the depth of the tree to be 100 and split value at 200. Let the training dataset D is divided into D1, D2, D3, D4, and D5. The calculation of validation is shown below for a fixed value of hyperparameter.

This helps to monitor the training as during training we validate the model on unseen data. Let us take accuracy as a metric for now. After training A1 is the training accuracy. If both the training accuracy and test accuracy are close then the model has not overfit. If the training result is very good and the test result is poor then the model has overfitted. If the training accuracy and test accuracy is low then the model has underfit. If the model is underfitting or overfitting then we change the value of the hyperparameter and again retrain the model until we get a good fit.

Hyperparameter tuning: In this, we take a range of all the hyperparameter and then we monitor the cross-validation accuracy on all the possible combinations of hyperparameters. We take that hyperparameter that gave the best accuracy(here we have taken accuracy as a metric). Then we train the model with that hyperparameter and then test it. Following is how cross-validation accuracy is calculated for each hyperparameter combination.

We can do hyperparameter using the GridSearchCv or RandomSearchCv provided by the sklearn library. GridSearchCv will check the cross-validation on all the possible combinations but RandomSearchCv checks by selecting the combinations randomly. Below is the code to do hyperparameter tuning. To view the complete code click here.

Code to do hyperparameter tuning.

I hope this clears up what overfitting and underfitting are and how to tackle them.

Having difficulty in understanding why we cross-entropy. Go through this blog to have a clear intuition.

Cross-Entropy, Log-Loss, And Intuition Behind It

In this blog, you will get an intuition behind the use of cross-entropy and log-loss in machine learning.

medium.com

If you have difficulty in understanding calibration the go through this blog.

Calibration in Machine Learning

In this blog, we will learn what is calibration and why and when we should use it.