A logical and sequential roadmap to understanding the advanced concepts in training deep neural networks.


We will break our discussion into 4 logical parts that build upon each other. For the best reading experience, please go through them sequentially:

1. What is Vanishing Gradient? Why is it a problem? Why does it happen?
2. What is Batch Normalization? How does it help in Vanishing Gradient?
3. How does ReLU help in Vanishing Gradient?
4. Batch Normalization for Internal Covariate Shift

Vanishing Gradient

1.1 What is vanishing gradient?

First, let’s understand what vanishing means:


Learn to correctly interpret the coefficients of Logistic Regression and in the process naturally derive its cost function — the Log Loss!

Source: Unsplash


Models like Logistic Regression often win over their complex counterpart models when explainability and interpretability are crucial to the solution. Despite this, unfortunately, Logistic Regression coefficients are not so easy to interpret as the usual Linear Regression coefficients.

Imagine choosing Logistic Regression for sole reasons of explainability yet presenting wrong descriptions to the business stakeholders. Ouch, not a pleasant scenario definitely!

In this blog, I have described how we can derive the interpretation of logistic regression coefficients naturally…

Ever felt curious about this well-known axiom: “Always scale your features”? Well, read on to get a quick graphical and intuitive explanation!


I am sure all of us have seen this popular axiom in machine learning: Always scale your features before training!
While most of us know its practical importance, not many of us are aware of the underlying mathematical reasons.

In this super short blog, I have explained what happens behind the scene with our favorite Gradient Descent algorithm when it is fed with features having very different magnitudes.

Understanding with an example

Let’s say we are trying to predict the life expectancy…

Ace your ML interview by quickly understanding which real-world use cases demand higher precision, and which ones demand a higher recall and why?

Why you should read this article?

All machine learning interviews expect you to understand the practical application of precision-recall tradeoff in real-world use cases, beyond just the definitions and formulas.

I have tried to capture this essence by defining a 🔑 “secret key” that you can exploit to ace your next ML interview and impress your interviewer by providing articulate justifications!


💡 Precision measures that out of all the positive predicted examples, how many detections were correct?

💡 Recall measures that out of…

Focal Loss explained in simple words to understand what it is, why is it required and how is it useful — in both an intuitive and mathematical formulation.

Source: Unsplash

Binary Cross Entropy Loss

Most object detector models use the Cross-Entropy Loss function for their learning. The idea is to have a loss function that predicts a high probability for a positive example, and a low probability for a negative example, so that using a standard threshold, of say 0.5, we can easily differentiate between the two classes. …

Empower your deep learning models by harnessing some immensely powerful image processing algorithms.



Many deep learning courses start with an introduction to the basic image processing techniques (like resizing, cropping, color-to-grayscale, rotation, etc.) but only provide a cursory glance at these concepts.
In this journey, we often discount the importance of some immensely powerful image processing algorithms which can do wonders to our final model predictions.

For folks who are familiar with the general ML paradigm, you know how vital data cleaning and feature engineering is for the success of any model. …

In the model training phase, a model learns its parameters. But there are also some secret knobs, called hyperparameters, that the model cannot learn on its own — these are left to us to tune. Tuning hyperparameters can significantly improve model performance. Unfortunately, there is no definite procedure to calculate these hyperparameter values. This is why hyperparameter tuning is often regarded as an art than science.

In this article, I discuss the 3 most popular hyperparameter tuning algorithms — Grid search, Random search, and Bayesian optimization.

Tune your model’s secret knobs called Hyperparameters

What is Hyperparameter Tuning?

Model training is a process through which a model learns its parameters. Besides this…

With so many rampant advances taking place in Natural Language Processing (NLP), it can sometimes become overwhelming to be able to objectively understand the differences between the different models.

It is important to understand not only how these models differ from each other, but also how one model overcomes the shortcomings of another.

Below I have drawn out a comparison between two very popular models — Word2Vec and BERT.

1. Context

Word2Vec models generate embeddings that are context-independent: ie - there is just one vector (numeric) representation for each word. …

Lavanya Gupta

AWS ML Specialist | Instructor & Mentor for Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store