Focal Loss — What, Why, and How?

Lavanya Gupta
The Startup
Published in
9 min readJan 28, 2021

--

Focal Loss explained in simple words to understand what it is, why is it required and how is it useful — in both an intuitive and mathematical formulation.

Binary Cross Entropy Loss

Most object detector models use the Cross-Entropy Loss function for their learning. The idea is to have a loss function that predicts a high probability for a positive example, and a low probability for a negative example, so that using a standard threshold, of say 0.5, we can easily differentiate between the two classes. I am going to start with explaining the Binary Cross Entropy Loss (for 2 classes) and later I will generalize this to the standard Cross Entropy Loss (for n classes).

Binary Cross Entropy Loss

Let’s understand the above image. On the x-axis is the predicted probability for the true class, and on the y-axis is the corresponding loss. I have broken down the Binary Cross Entropy Loss into 2 parts:

  1. loss = -log(p) when the true label Y = 1
    Point A:
    If the predicted probability p is low (closer to 0) then we penalize the loss heavily.
    Point B: If the predicted probability p is high (closer to 1) then we don’t penalize the loss heavily.
  2. loss = -log(1- p) when the true label Y = 0
    Point C: I
    f the predicted probability p is low (closer to 0) then we don’t penalize the loss heavily.
    Point D: If the predicted probability p is high (closer to 1) then we penalize the loss heavily.

Note that the model’s predicted probability p is always w.r.t. the sample belonging to Y=1 class. So, a high probability value of p=0.8 is the predicted probability of the sample belonging to class Y=1.

Combining the above two losses, we get:

Generalizing from 2 to N classes, the cross-entropy loss is also written as:

Cross Entropy

The above equations make sense because the cross-entropy is penalizing the model based on how far our prediction is from the true label. But one must note that the cross-entropy loss gives equal importance (or weightage) to the classes. The cross-entropy loss is a function of only the predicted probability p, i.e. for a given predicted probability p, the loss value calculated will be the same for any class. In other words, the cross-entropy loss value for any sample does not change based on the class (i.e. class-agnostic) and only depends on the predicted probability p.

Now the question arises — Why is giving equal importance to classes an issue? Let’s see.

Focal Loss for Object Detection

Penalizing the model equally for all classes is a drawback, especially when there is a skewness or imbalance in the occurrence of classes in the image. A typical candidate image for object detection would comprise of many background regions (Y=0) but only a few foreground regions (Y=1), i.e. regions containing our object(s) of interest. This leads to a class-imbalance problem.

The Class-Imbalance Problem

The class imbalance problem affects an object detection model in the following 2 ways:

a) No learning due to easy negatives

If you build a neural network and train it for a bit it will quickly learn to classify the negatives at a basic level. From this point on, most of the training examples will not do much to improve the model’s performance because the model is already doing a decent job overall. This makes the training inefficient because most locations in the image are easy negatives (i.e — they can be easily classified by the detector as background) and hence contribute no useful learning.

b) Cumulative effect of many easy negatives

We have seen that images have a high number of negatives (background classes). When small losses from such easy negatives are summed over many images, it overwhelms the overall total loss, giving rise to degenerated models. Easily classified negatives comprise the majority of the loss and dominate the gradient. This shifts the model’s focus on getting only the negatives examples classified correctly. In other words, the weights are updated in such a way that the performance on only the negatives becomes better and better.
For example, A high confidence prediction of the background class (Y=0) will contribute = -log(1-p) = -log(1–0.05) = -log(0.95) = 0.05 to the loss. And a high confidence prediction of the foreground class (Y=1) will also contribute = -log(p) = -log(0.95) = 0.05 to the loss function.
Both contribute equally to the loss, but since the background classes are more in number, the loss originating from it would dominate the overall loss.

A common solution is to perform some kind of hard negative mining that samples hard examples during training or employing more sampling and weighing schemes or adding more instances of the less dominant class in training set through data augmentation.
Is there any other way of solving the class imbalance problem without modifying the dataset?

This necessitated the need for a new improved loss function. This shortcoming was exploited by the Facebook AI Research (FAIR) group in 2018 who introduced the Focal Loss with their one stage detector called RetinaNet!

What problem does Focal Loss solve?

You can now guess that we are looking for a new improved loss function that solves a 2-fold problem:
1. balance between easy and hard examples
2. balance between positive and negative examples

Particularly, the loss function should be able to downweigh the easy examples and focus on hard examples.

Problem 1: Easy and hard examples

Objective — Focus less on easy examples and more on hard examples

Focal Loss for Y = 1 class

We introduce a new parameter, modulating factor (γ) to create the improved loss function. This can be intuitively understood from the image above. When γ=0, the curve is the standard cross entropy loss, and the range of predicted probabilities p where the loss is low is limited to ~[0.6, 1] (blue curve). This is the area of well-classified easy examples with high probability. In other words, even examples that are easily classified (p >> 0.5) incur a loss with non-trivial magnitude. Remember that we are looking for a way to reduce the loss contribution from such examples.

Now as we increase the value of γ (upto 5), we slowly extend that range of predicted probabilities where the loss value is low to ~[0.3, 1] (green curve). This means we are “extending” or “relaxing” our criteria of well-classified examples by increasing γ.

We can interpret the difference in the shape of the loss curves by observing that a model trained with standard cross-entropy loss γ =0 (blue curve) will continue trying to push scores even for the well-classified examples further and further away till the predicted probability p is perfectly 1. Whereas a model trained with focal loss will not care too much about the well-classified examples (ie. will not focus on getting a perfect probability p=1) and instead work more towards improving the hard examples (ie. whose predicted probability p is much lesser than 1). The idea is that if a sample is already well-classified, we can significantly decrease or down weigh its contribution to the loss. This allows the model to incorporate the losses made on the misclassifications arising only from the very low probabilities, ie- hard examples.

We can tune the modulating parameter γ using cross-validation. Note that with γ=0, the Focal Loss is equivalent to Cross-Entropy Loss (blue curve).

The modulating factor reduces the loss contribution from easy examples and extends the range of probabilities that contribute to a low loss value.

Solution 1: Focal loss for balancing easy and hard examples using modulating parameter γ

Problem 2: Positive and negative examples

Objective — Balance between the class instances

By incorporating γ we were able to differentiate between the easy and hard examples. But we have still not fixed the class imbalance problem of positive and negative examples.

To do so, we add a weighting parameter (α), which is usually the inverse class frequency. α_t is the weighted term whose value is α for positive (foreground) class and 1-α for negative (background) class.

Solution 2: Focal loss for balancing positive and negative examples using the weighting parameter α

Overall Focal Loss

Combining the above two focal loss versions, we get:

Hence, the Focal Loss function is a dynamically scaled cross-entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. It is ultimately a weighted Cross-Entropy Loss that weighs the contribution of each sample to the loss based on the classification error.

Focal Loss in action (for 2 classes)

I have captured the 4 possible scenarios in the table below for a fixed recommended value of γ = 2 and α = 0.25.

We see that the CE/FL ratio in the case of an easy classification is large (400 and 150); while in the case of a hard classification is small (merely 5 and 2). This means that the loss from easy examples is scaled down by a large factor of 400 or 150, but the loss from hard examples is scaled down only by a negligibly small factor of 5 or 2. This validates the premise that focal loss significantly down weighs the easy examples, which in turn assigns more importance to hard examples.

Note that it makes more sense to use α=0.75 since the positive samples are usually in minority. However, we could see in the calculations above that the contribution of α sometimes does not affect loss very significantly.

Effectiveness of Focal Loss

Let us take an extreme example to strengthen our understanding. Suppose we have 1000000 samples with p=0.99 and 10 samples with p=0.01. The 1000000 samples with p=0.99 happen to all be negative and the 10 samples with p=0.01 happen to all be positive.

In this scenario if we use the standard cross entropy loss, the loss from negative examples is 1000000×0.0043648054=4364 and the loss from positive examples is 10×2=20. Hence, the loss contribution from positive examples is 20/(4364+20)=0.0046. Almost negligible.

In the scenario is we use the focal loss instead, the loss from negative examples is 1000000×0.0043648054×0.000075=0.3274 and the loss from positive examples is 10×2×0.245025=4.901. The loss contribution from positive examples is 4.901/(4.901+0.3274)=0.9374! It is dominating the total loss now!

This extreme example demonstrated that the minor class samples will be less likely ignored during training.
(This example has been shamelessly copied from Lei Mao’s blog.)

Conclusion

The focal loss is designed to address the class imbalance by down-weighting the easy examples such that their contribution to the total loss is small even if their number is large. It focuses on training a subset of hard examples.

Good to know

Although we have discussed Focal Loss in the context of Object Detection, I want to share that Focal Loss can be utilized in classification problems as well (instead of the standard cross-entropy loss) where there is an imbalance in the dataset classes.

--

--

Lavanya Gupta
The Startup

Carnegie Mellon Grad | AWS ML Specialist | Instructor & Mentor for ML/Data Science