Why gradient descent doesn’t converge with unscaled features?

Lavanya Gupta
Analytics Vidhya
Published in
3 min readMar 1, 2021

--

Ever felt curious about this well-known axiom: “Always scale your features”? Well, read on to get a quick graphical and intuitive explanation!

Motivation

I am sure all of us have seen this popular axiom in machine learning: Always scale your features before training!
While most of us know its practical importance, not many of us are aware of the underlying mathematical reasons.

In this super short blog, I have explained what happens behind the scene with our favorite Gradient Descent algorithm when it is fed with features having very different magnitudes.

Understanding with an example

Let’s say we are trying to predict the life expectancy of a person (in years) using 2 numeric predictor variables/features: x1 and x2, where x1 is the age of the person and x2 is his/her salary. Cleary, x1 << x2.

This is a regression problem where we aim to learn the weights theta1 and theta2 for x1 and x2 respectively by minimizing the cost function — Mean Squared Error (MSE).

If we plot theta1, theta2, and cost:

Left: Cost function with scaled features Right: Cost function with unscaled features (elongated in the direction of the smaller magnitude feature)

Left figure: With feature scaling
The cost function is a perfect circle (in 2D) or hemisphere in (3D). Gradient descent is able to reach the minima (center) easily in a shorter time.

Right figure: Without feature scaling
The cost function becomes an ellipse that is stretched/extended in the direction of the smaller magnitude feature.

️Extended direction ➡ Smaller gradient ➡ More steps required to reach the minima

Since x1 (age) is smaller in magnitude than x2 (salary), it takes a larger change in theta1 to affect the cost function. To make even a small change in the value of theta1, gradient descent needs to travel a much longer distance in the horizontal direction, ie- the direction of x1 in this case.

We would ideally like to:
- move quickly in direction with small gradients and
- move slowly in direction with large gradients

⭐️ Hence, feature scaling is required for gradient descent to converge easily.

Note that not all ML algorithms require feature scaling. For example, it is not mandatory to scale the features before training a decision tree or random forest model. This is because their cost function (Gini index, Entropy etc.) is not distance-based (ie- gradient descent is not required) and hence it is immune to the scale of the features. You can read more on this here.

Summary

Without feature scaling, gradient descent will require a lot more steps to reach the minima. In other words, gradient descent will take a lot of time to converge thus increasing the model training time.
To avoid this, it is always advisable to use feature scaling when working with distance-based cost functions (like MSE, KMeans, SVM etc.).

If you have made it this far and found this article useful, then please hit 👏 . It goes a long way in keeping me motivated, thank you!

--

--

Lavanya Gupta
Analytics Vidhya

Carnegie Mellon Grad | AWS ML Specialist | Instructor & Mentor for ML/Data Science