Regularization: A brief overview.

Photo by Shiva Smyth from Pexels

So, you've finally wrapped your head around the basics of data gathering, preprocessing, model building, and evaluation. Then you notice your model is overfitting.😐

So, What's going on? 🤔

Overfitting occurs when your model captures noise in your training dataset.

NB: Noise refers to data points that don't represent the true properties of your data.

How do I fix it? 🤔

A good and common way to reduce overfitting is to regularize the model.

NB: "to regularize" also means and can be replaced with "to constrain"

We have 3 common techniques of regularization in machine learning, namely, Ridge Regression, Lasso Regression, and Elastic Net.

Now, Let's get into the details. 😉

Ridge Regression Also known as Tikhonov regularization, this is a regularized/constrained version of Linear regression; in this case, a regularization term is added to the cost function. This forces the algorithm to not only fit the data but also keep the model's weight as close to 0 as possible but not exactly at 0 because this would result in a flat line going through the data's mean.

NOTE: Always scale your data with StandardScaler before using Ridge regression, as it is sensitive to the scale of the input features. This is true of most regularized models.

Below is a code snippet of how to implement ridge regression using Stochastic Gradient Descent:

sgd_reg = SGDRegressor(penalty="l2")
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

>> array([1.47012588])

The penalty hyperparameter sets the regularization term to use. In this case, we specify "l2", which is simply Ridge Regression.

Lasso Regression

Least Absolute Shrinkage and Selection Operator Regression (or Lasso Regression for short) is the second regularized version of Linear Regression on our list. Just like the Ridge Regression, it adds a regularization term to the cost function, but it uses the l1 norm of the weight vector.

An important characteristic of Lasso Regression is that it tends to eliminate the weights of the least important features (i.e., set them to zero).

Below is a code snippet to implement Lasso Regression:

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

>> array([1.53788174])

Finally, we look at Elastic Net.

Elastic Net is a middle ground between the Ridge and Lasso regressions. The term is a simple mix of both regression techniques, and you can control the mix ratio, r. When r = 0, Elastic Net is equivalent to Ridge regression and when r = 1, it is equivalent to Lasso Regression.

Below is a code snippet to implement Elastic Net (l1_ratio corresponds to the mix ratio, r):

from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

>> array([1.54333232])

Conclusion

With all this, you begin to wonder. When do I use a certain technique? 🤷‍♂️

It is almost always preferable to have at least a little bit of regularization. So generally, you should avoid plain Linear Regression. Ridge is a good default, but if you suspect only a few features in your dataset are useful, you should prefer Lasso or Elastic Net because they tend to reduce the useless features’ weights down to zero.

In general, Elastic Net is preferred over Lasso because Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.

Happy New Year 🥳, and
Happy Hacking 🤓