Transfer learning in practice. Image classification for hotel images with library

Kornel Dylski
February 21, 2019

For some time now, we have been focusing on expanding our skills in AI and Machine Learning, as these are undoubtedly some of the hottest topics in software development. After some delving into deep learning, we decided to create a simple, yet useful application.

Image classification is a well-known task in deep learning, but there is still plenty of space for new projects. Many projects at nexocode are related to travel and leisure industry. That is why we looked for a need in this particular domain which can be addressed with the power of ML algorithms. What we’ve noticed was that booking engine sites are handling massive amounts of hotel images, sometimes of low quality or not of particular interest for travelers, but the main issue here is that the content of the picture is not described. There is no automatic way to classify them or recognize the ones that are not related to the offer. It’s not uncommon that you would need to do the classification manually, but this is where Machine Learning and Artificial Intelligence will turn data into competitive advantage. We decided to classify photos related to the hotel industry, which can have many future uses.

To achieve the goal we have used the library which increases the level of abstraction of PyTorch. It is relatively new, but already supports good practices and is always up to date with advancements in deep learning. It is provided by Jeremy Howard with his (excellent) course, very suitable for creating prototypes of neural networks. It allowed us to focus on decisions rather than coding.

To perform any machine learning, we need to collect and prepare data, and for image classification, we need a dataset of already classified images. In our case, they were hotel related images. After research I found some options to collect these.

First, there is always plenty of free to use datasets. Kaggle is a great source, as well as MIT collections, and many others. For our task, indoor scene recognition dataset fits pretty well. It needed some preprocessing, and there were many more categories than necessary. When it comes to training data, it is always easier when you have more than required, as you can exclude unnecessary ones. The second option is simply using Google Graphics. They have open API, and GitHub provides a convenient library. First hundred results for each keyword (category) is usually quite accurate. If we need more, you may need to take a brief look at them and exclude the inaccurate images. There is also a third way, to rent a poll, e.g., Amazon Mechanical Turk. In our case, we used the first two options.

I want to mention here, that preparing categories and the whole dataset is quite an important task. In our case, categories should be exclusive. It’s easy to choose two categories which both should contain the same image. What is the difference between a closet and a pantry? We know that a closet can be a pantry. But the network doesn’t know.

Then we have to split the dataset into two parts: training part, on which we teach the model, and validation part, used to measure its performance. 7 to 3 is an appropriate ratio. Here is a piece of advice, set up a random seed before you start. If you don’t and you try to continue training next time, your split will be different, and validation set can overlap training set. It is a straight way to overfit (and If you don’t have the third dataset to check, you won’t see it). Overfitting is when your network “memorizes” images instead of learning to generalize them.

We should have a roughly similar number of images per category, and we should remember to normalize the images ( can handle it).

ResNet, is a neural network model from 2015, now with many variants and improvements, but the basic version is still “good enough” for almost every common case. A whole new article would be needed to describe this model properly, but briefly, it is a model which allows adding more layers and going deeper without the loss of results (it resolves the problem of the vanishing gradient).

Now the critical functionality: transfer learning. When we create a neural network from scratch, all weights are initialized with zeros or some random values. And adjusting them for image recognition takes much more time then we have. To save some time we are using a network that is already trained on images with standard categories. Only a few top layers of weights are responsible for choosing the proper class. All layers below are recognize elements with a smaller level of abstraction, i.e. gradients, curves, lines, pixels, etc. So we only have to retrain the top layers from scratch, and slightly adjust the others.

Before training, we have parameter values to choose. The most important is the learning rate (lr). To achieve faster convergence (and maybe better accuracy) learning rate should change during training. The library encourages us to set the learning rate in accordance with the following shape.

Learning rate change during training

We have to choose top value, and the shape will be applied automatically.

Here is how simple run looks like:

Our model is resnet50, earlier pretrained on imagenet. We are training it for 10 epochs. But before it, to find proper learning rate (lr) we need to run:

Learning rate finder function

Graph shows, how loss changes for different learning rates. We chose value where the loss is still decreasing. Here it is ~ 10-2.

This is how the results look after learning over 10 epochs, with a learning rate of 10-2

Training results

This is a plot of losses, for training set and validation set. The network is performing much worse on the validation set. It overfits.

To prevent overfitting, we have some quick solutions like a dropout, weight decay regularization, and data augmentation.

Dropout is a technique where part of weights is randomly turned off alternately, to enforce the rest of the weights to be more multi-tasking. The network tends to generalize better, and it is reducing overfitting. It is also available in the library, but we won’t use it now.

Weight decay is a regularization technique, where the training algorithm adds penalty when weights values are going too low or too high. Perfect weight values should stick to distribution with mean equaling zero, and standard deviation near to one.

Data augmentation is a technique where images are randomly transformed:, e.g.rotated, zoomed or skewed, so the arrangement of pixels is changing, but the meaning and the labels are the same. It is also available in the library.

And finally, we can always increase the size of the dataset or change the number of epochs.

Training improved results

We can see some improvement, but the overall network has a higher loss and lower accuracy, for both sets. Both, weight decay and dropout, if too big, are pushing down network performance. For now, there is no other way to tune them up other than empirical…

Training final results

Now, it looks much better. In the end, it’s overfitting a bit, but overall, both sets are improving.

Results are:

metric value
train loss 0.703716
valid loss 1.000399
accuracy 70,4%

Further training doesn’t improve much.

Further training

We tried to tune it up in different ways, but still, it’s not getting anywhere near 96% - the level we can get when we train the network to recognize cats vs. dogs. Where is the difference?

Confusion matrix can show us how our network classifies different categories.

Confussion matrix

The network is classifying most categories correctly (at diagonal), but there are some shortcomings.

99 - front_hotel as outdoor
81 - outdoor as front_hotel
49 - reception as lobby
44 - bar as restaurant

Intuitively some bars can be easily confused with restaurants, as well as a reception with a lobby. Front hotel and outdoor are definitely wrongly chosen categories. After exclusion of unnecessary classes, we can achieve up to 76% accuracy. There is still the missing 20%.

We can look up at the worst classification:

croquettes labeled as 'lobby'
croquettes labeled as 'lobby'"

It’s not a lobby, but neither it is a restaurant, as it was marked at the origin. We could correct some wrong labels or remove unrelated images, and we will get another 2–5%.

But actually what network should do with the picture on which the croquettes are? We can’t just add another category labelled as “unknown” because an “unknown dog” is very different from an “unknown jetpack.”

This is a wide field for further research, and I will describe it in the next article.

Check out live version of our application here

Hotel image recognition lab

Now, let's talk about your project!

We don't have one standard offer.
Each project is unique, rest assured that we will approach the next one full of energy and engagement.