Classifying unknown gestures with modified loss function

Kornel Dylski
September 28, 2020

Image classifiers are a common concept in deep learning projects. While the theory is well researched and some state-of-the-art achievements in this area are truly impressive, there are still challenges in real-life applications.

Challenges with gestures application #

We’ve recently had the opportunity to create a real-time classifier for webcam images. We created a neural network that could be trained to recognize gestures shown in front of the camera.

Some of the initial challenges in the project included an insufficient number of images for training and the lack of an explicit indication of the unknown category.

While the former is a typical issue in deep learning projects, i.e. the training dataset isn’t large enough to cover every possibility. The lack of diverse data can be handled by data augmentation (described e.g. here), which involves generating photos shot from different perspectives, with different exposure and lighting.

The latter issue was more tricky but no less critical – the classifier records continuously but because the participant may not show any gestures, the forecasts should return nothing most of the time. We thus needed to create a classifier that would not only recognize a gesture, but also be able to acknowledge that none is shown.

Following some research, we found that the most common solution to the problem was only a compromise and involved adding an additional category and training the identifier with random pictures from outside of the basic scope. This would require adding an endless number of pictures. We wanted to avoid that as it does not sufficiently cover the unknown category concept.

We did some experiments and modifications that can cover the unknown category, which will be discussed in this article. You can also check out notebook which presents experiments.

Why handling the unknown is important #

For a basic network which does not consider unknown as a category, all unidentified pictures will have to fall into one of the existing known categories (almost randomly). This is really bad, because the network could make confident predictions and still be wrong.

Here are two confusion matrices which show validation results of the same network. The first for the pictures taken from known categories. The second is for pictures taken from unknown categories as well. This shows how deceptive it can get when you only validate with categories from your training dataset.

Confusion matrix for the pictures taken from known categories

Confusion matrix for the pictures taken from known categories

Classifier loss function #

To address this issue you should first know how the classifier understands the concept of category and how loss function makes it to learn. If you are familiar with this you can skip to paragraph six.

To train the classifier we first gather data arranged into categories. We pass this data through a neural network. Then, loss function is applied on the output and the corresponding category.

Categories are represented by numbers, so the number 5 is the tiger.

The loss function is crucial. It’s usually a pure function that takes prediction (network output) and target (corresponding category) vectors, and returns the loss value. Loss value tells the network how far the network output indication was from the correct category.

Standard approach with categorical cross-entropy #

Understanding the formula is not required to understand the concept (!)

For classification, the basic loss function is cross-entropy (CE). You can work it out later, but for now you only have to know that it compares the vectors tᵢ and sᵢ and then returns a value that represents the difference. The function can be perfectly applied in this case.

Cross entropy returns value that indicates difference between vectors

But first, both prediction and target have to be converted into vectors to fit into this function. To do so, softmax function is applied on prediction and one-hot encoding is applied on target.

One hot encoding simply takes a category represented by an integer number and returns a vector whose all values are zeroed except the one at the index corresponding to this number (see the example below).

One hot converts the fourth category into a vector made of zeros, with one at the fourth index

Softmax is a function that converts a vector of any number’s values into a vector of numbers with values between zero and one, so that the sum of all values is one. Thanks to this, the resulting converted vector can be treated as a probability that the result is a specific category.

Same here, knowing the formula is not required to understand the concept

The vector is shrinked between 0 and 1 in the way that the sum of the values equals 1

Please note the side effect – the largest value changes disproportionately to the others, taking the largest portion. This causes the function to be able to indicate only one category.

Change categorical to binary cross-entropy #

Our loss function categorical cross-entropy indicates which category is the most probable. But the network should be equally able to classify multiple categories or none of them.

To allow this, you have to change softmax to sigmoid function and apply cross-entropy individually to each value, making the probabilities independent. Instead of determining which category is most probable, we determine, for each category, how probable it is that it occured. This is now called binary cross-entropy (BCE).

Each vector value shows its own loss and average is returned

Add unknown category #

We divide categories from a dataset into three types.

Note the difference. The dataset category is a set of images signed with the same label. The network category is a number that can be recognized by the network as a label.

First, known categories. These are common categories which the network recognizes, and pictures from these categories can be predicted by the network as one of the categories.

Second, unknown categories. We’ve chosen a few additional categories which do not need to be classified separately. All pictures from them are labeled unknown and should be recognised by the network as unknown.

Third, unseen categories. They are similar to unknown categories and should also be recognized as such by the network. The images of these categories, however, are not used in training, but for validation. The network will have to recognize them as unknown even though it has not seen any analogous pictures before.

Categories types

To start handling pictures from unknown categories, we only add a single category at the beginning, and call it “na”

The “na” category will be used to handle unknown pictures

The naive approach (which, in fact, may prove efficient if you don’t expect exceptional pictures) would suggest to convert this to one-hot vector and, as per usual, pass it to the loss function. Pictures from unknown categories will be then learned and similar pictures will be successfully marked as “na”.

The results are much better with this category as most of the unknown and unseen categories fall into “na”. But as shown below (first row), unseen pictures spread across other categories as well.

Confusion matrix for binary cross-entropy

Modify loss function to handle unknown category #

We’ve modified the loss function to consider unknown (and unseen) pictures a bit differently. Intuition suggests that the function should learn that all unknown categories are not probable at all. Network prediction for these pictures should thus be a vector of zeros.

During training, targets for unknown categories should also be changed to vectors of zeros (instead of vectors with zeros and the value 1 at beginning).

# apply sigmoid and one-hot
input = input.sigmoid()
target = F.one_hot(target, input.shape[1]).float()

# change all target category “na” (placed at 0 index) to zero
target[:, 0] = 0

# finally count bce loss
loss = F.binary_cross_entropy(input, target)

Confusion matrix for modified binary cross-entropy

The results have improved. Now unseen pictures more often tend to be predicted as unknown.

With this modification the accuracy for known categories is a bit lower, but the accuracy for unseen categories has vastly improved, along with the overall accuracy (see the results in the notebook).

While the results still depend heavily on the number of images seen by the network, it is a step in the right direction.

Do not forget about metrics #

You should also modify metrics and other functions which are interpreting network output. Metrics have to know that a vector with values close to zero indicates an unknown category.

Here is an example of modified accuracy function. All vectors whose highest value is below the threshold are considered to fall into the “na” category.

def accuracy(input, target, thresh=0.4, na_idx=0):
  valm, argm = input.max(-1)

  # results below threshold are considered as "na" category
  argm[valm < thresh] = na_idx

  return (argm==target).float().mean()

Look at experiments #

Notebook contains implementation, training and validation results. All was implemented in pytorch with fastai2 framework.

We performed three experiments to compare results.

  1. simple CE classification
  2. BCE classification with additional “na” category
  3. BCENa (modified) classification with additional “na” category

For experiments, we used the Imagenette dataset which is a subset of ImageNet but contains only 10 categories.

The code and notebook are available on my github. You are welcome to try the code in your projects.

Any feedback, comments and questions are welcome.

Now, let's talk about your project!

We don't have one standard offer.
Each project is unique, rest assured that we will approach the next one full of energy and engagement.