Lousy Dataset? Tips for Training Your Neural Network

Lousy Dataset? Tips for Training Your Neural Network

Kornel Dylski - May 20, 2019 - updated on July 21, 2024

Machine learning is often described as a way to make computers perform some actions by showing them initial data and action results instead of giving them explicit instructions of a particular action. For image recognition tasks we provide the ML model images together with described images to make the model finally recognize the images on its own and provide recognized images. Machine learning models attempt to approximate information from the received data to use them further. When a dataset is vast enough, a model can learn the majority of distribution, and new samples are easily approximated. On the other hand, poor dataset results in lousy approximation. Unluckily gathering and preparing data is usually the most expensive task and takes the most time and sometimes you just cannot afford to get more data.

Recently we were creating a neural network for image recognition. The size of the training dataset was so limited that we knew a standard approach was doomed for failure. We didn’t have tons of data to train the NN and therefore to succeed we had to try something different. In this article, I describe the reasoning process behind our efforts. I would like not to focus on specific libraries but more on general procedures one may use. I strongly encourage you to dig more into each approach if you consider it useful.

When we can search the internet for data

The first technique, called transfer learning comes from the idea that it is perfectly ok to recycle and reuse a model that has already been trained on a large general set instead of training model from scratch. For general cases, we usually can find a similar data set and a model that is already trained on it. Because random parameters of the model have much less chance to approximate solution, almost always a better idea is to train using a model that was pretrained before. I encourage you to search through the model zoo and state-of-the-art to draw from the experiences of others.

Regardless of whether you found a pretrained network that suits your needs, you have to collect training data for your specific task. You should spend time with the search engine, even if you are able and ready to collect data manually. Maybe someone has already done this for you.

When collecting pictures, it’s easy to forget that the model can not find a solution if the information it needs is not included in stored pictures. Assume you want to classify mammals based on their pictures. The dataset needs to be carefully planned if you want the network to be well-trained. There is a need to collect additional photos to make the results more independent of the style and the source of data — for example, a sketch of an elephant or a picture of a cat at night. Additional categories may be useful if you expect that model to see objects other than mammals. Otherwise, the network has to assign an unrecognized object to one of the categories of mammals. Next, do not forget about exceptions like dolphins. The model could easily confuse the fact that their pictures almost always contains water environment. Perhaps the model should see a few dolphins in unique environments to learn that the actual shape of the dolphin causes it to be a dolphin, not water around. Also, the picture of the dolphin’s rear fin can be very different from the picture of his head. In the subject of pictures of objects, people or nature, almost everything has been already said, so you will quickly find many examples in Google Images to fill your dataset. Dolphins from each side and many more. And the same concept of trying to cover the entire distribution also applies to other, less obvious themes.

dolphin and a dog, both classified as a dolphin
both pictures classified as a dolphin

When we have exhausted the internet data sources

Data augmentation, which is another useful process, is necessary for the model to prevent learning by heart when it sees the same picture repeatedly. To make the model more aware of what’s in the picture, each picture is used with random transformations. These transformations may include rotation, scaling, inverting the image but also adding noise and distortion.

Actually, we can use any function if the resulting image still unambiguously represents the same object. Data augmentation is a simple technique to ensure that the model is not paying too much attention to the specific location of the pixels.

picture of squirrel with applied random transformations
squirrel image with applied six random transformations

Worth to mention step before data augmentation is scaling. Usually, using a pretrained network, you are condemned to the shape and size. Every picture has to be rescaled to fit into the network, and if the picture does not match the same proportion also has to be cropped or get some padding. Cropping can lead to information loss from the picture, so the preferred option is to add padding. What can fill for extra pixels? It may be a solid color, blurred edges or side reflection of the picture. For instance for space photos, black background as padding would work well. For landscape photos, reflection operation will do the job.

elephants picture with different paddings
reflection seems to be the most natural here

When pictures are prepared, you want to perform data augmentation. So here comes one of not the very common but yet an effective technique is the cutout regularisation. The idea is to cut out a randomly placed rectangle from the picture. The network should figure out what is missing in a similar way as humans guess which puzzle is missing. Thanks to the example with the GAN used, we can see that the network can restore the missing part. This may remind you of a commonly used dropout regularisation technique, in which some neurons can be randomly switched off during training to prevent overfitting and to improve results.

lion picture with and without cutout
cutout regularisation

Another effective technique is a mixup. In simple words, when two images are overlapping on each other in some proportion, a neural network can be trained to distinguish that. For the picture on which 50% of pixels are from the image of the dog and rest is from the image of the cat, the network has to be told that the solution is a half-dog half-cat picture.

Among the more sophisticated techniques used for augmentation are Generative Adversarial Networks. GAN is a kind of network that is composed of two subnetworks. First one (the generator) is consistently trying to generate images from random noise. The second subnetwork (discriminator) is processing alternately images generated from the generator and real images. The discriminator is judging which image is real and which one is generated. Both subnetworks are trained until the generator fools the discriminator, and the generated images no longer stand out from the real ones. Generated images can be successfully used to enlarge the size of the data set. You can read about it further in Low-Shot Learning from Imaginary Data paper where this approach was used.

However, there are many cases when we are the first in the field, and there are no pretrained models yet. If there are also no datasets available, data must be collected manually, and it is not easy to speed up the process. For such a sparse dataset data augmentation may be not enough and creativity must come in place.

When our dataset is still poor

If the current data set is still not large enough to improve training, you may want to start labeling images. Images can be labeled manually or by others with some polling. However, both are money- and time-consuming. There is a branch of machine learning called semi-supervised learning. Basically, the network is figuring out the solution based on parameters on subsequent layers. The last layer outputs solution. The previous layer outputs something that represents the features of the image.

Using various methods like t-SNE, you can place each picture on the plot according to their features. Pictures with similar features will be plotted close to each other and you can expect that these pictures represent the same category. Thanks to this, you have to categorize only pictures with features far from others.

Then you can retrain the network using more labeled pictures. Labels generated this way are somewhat unsure. Assuming that the network can already recognize trivial examples with a low error, you should accept the labels generated this way, only for pictures with a similar low error.

DIY - make your own dataset

When there is no way to gather more pictures and typical transformations are not enough, you may consider creating a totally custom generator. If you think about it more, creating sufficient generator should be possible.

Implementation of a generator is quite obvious. Take as an example network for recognizing objects packed into a box. Arrangement of objects or shape of the box does not play such a significant role for recognition. Every object and every box can be separated from each other. When the set is built with separated objects and boxes, you need to write and tune up the generator which always takes a few of the objects for a newly generated picture one box underneath and place them on the picture. You should think about how to place them naturally and how objects should overlap others. Thanks to this, an infinite number of pictures can be created by combining the boxes with objects. For a dataset with tens, not thousands of pictures it is a blessing.

In some products, a more advanced generator is desired. It is very interesting to employ a rendering engine, which can reproduce a real object. Creating a realistic picture requires a lot of talent and precision. Usually, you will be condemned to basic models that can be created and rendered in a relatively short time. Fortunately, with domain adaptation, poor reproductions can be used as well. GAN network can be trained to erase differences between a source (rendered picture) and a target (real picture) domains. This time generator subnetwork has to deceive a discriminator until the differences of pictures from different domains, representing the same object, will not cease to affect the classification.

Conclusion

A truth nobody can deny is that a big enough dataset is crucial to perform machine learning successfully. However, due to the lack of necessary data, many parallel topics are being developed to reduce the effort required to provide them. I can’t wait to see further progress in that area. And of course, if you reading this article, have something to add, please share!

About the author

Kornel Dylski

Kornel Dylski

Software Engineer

Kornel is a frontend engineer with several years of experience building robust web applications. Apart from web solutions, he participates in machine learning projects. He has always been interested in physics, which led him to explore artificial intelligence and programming languages such as Python.
His focus is on solving technical problems and providing data-driven solutions to clients' needs. He has a creative spirit and loves to make people laugh or smile while working together on complex issues.

Tempted to work
on something
as creative?

That’s all we do.

join nexocode

This article is a part of

Zero Legacy
36 articles

Zero Legacy

What goes on behind the scenes in our engineering team? How do we solve large-scale technical challenges? How do we ensure our applications run smoothly? How do we perform testing and strive for clean code?

Follow our article series to get insight into our developers' current work and learn from their experience. Expect to see technical details, architecture discussions, reviews on libraries and tools we use, best practices on software quality, and maybe even some fail stories.

check it out

Zero Legacy

Insights from nexocode team just one click away

Sign up for our newsletter and don't miss out on the updates from our team on engineering and teal culture.

Done!

Thanks for joining the newsletter

Check your inbox for the confirmation email & enjoy the read!

This site uses cookies for analytical purposes.

Accept Privacy Policy

In the interests of your safety and to implement the principle of lawful, reliable and transparent processing of your personal data when using our services, we developed this document called the Privacy Policy. This document regulates the processing and protection of Users’ personal data in connection with their use of the Website and has been prepared by Nexocode.

To ensure the protection of Users' personal data, Nexocode applies appropriate organizational and technical solutions to prevent privacy breaches. Nexocode implements measures to ensure security at the level which ensures compliance with applicable Polish and European laws such as:

  1. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (published in the Official Journal of the European Union L 119, p 1); Act of 10 May 2018 on personal data protection (published in the Journal of Laws of 2018, item 1000);
  2. Act of 18 July 2002 on providing services by electronic means;
  3. Telecommunications Law of 16 July 2004.

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet.

1. Definitions

  1. User – a person that uses the Website, i.e. a natural person with full legal capacity, a legal person, or an organizational unit which is not a legal person to which specific provisions grant legal capacity.
  2. Nexocode – NEXOCODE sp. z o.o. with its registered office in Kraków, ul. Wadowicka 7, 30-347 Kraków, entered into the Register of Entrepreneurs of the National Court Register kept by the District Court for Kraków-Śródmieście in Kraków, 11th Commercial Department of the National Court Register, under the KRS number: 0000686992, NIP: 6762533324.
  3. Website – website run by Nexocode, at the URL: nexocode.com whose content is available to authorized persons.
  4. Cookies – small files saved by the server on the User's computer, which the server can read when when the website is accessed from the computer.
  5. SSL protocol – a special standard for transmitting data on the Internet which unlike ordinary methods of data transmission encrypts data transmission.
  6. System log – the information that the User's computer transmits to the server which may contain various data (e.g. the user’s IP number), allowing to determine the approximate location where the connection came from.
  7. IP address – individual number which is usually assigned to every computer connected to the Internet. The IP number can be permanently associated with the computer (static) or assigned to a given connection (dynamic).
  8. GDPR – Regulation 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of individuals regarding the processing of personal data and onthe free transmission of such data, repealing Directive 95/46 / EC (General Data Protection Regulation).
  9. Personal data – information about an identified or identifiable natural person ("data subject"). An identifiable natural person is a person who can be directly or indirectly identified, in particular on the basis of identifiers such as name, identification number, location data, online identifiers or one or more specific factors determining the physical, physiological, genetic, mental, economic, cultural or social identity of a natural person.
  10. Processing – any operations performed on personal data, such as collecting, recording, storing, developing, modifying, sharing, and deleting, especially when performed in IT systems.

2. Cookies

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet. The Website, in accordance with art. 173 of the Telecommunications Act of 16 July 2004 of the Republic of Poland, uses Cookies, i.e. data, in particular text files, stored on the User's end device.
Cookies are used to:

  1. improve user experience and facilitate navigation on the site;
  2. help to identify returning Users who access the website using the device on which Cookies were saved;
  3. creating statistics which help to understand how the Users use websites, which allows to improve their structure and content;
  4. adjusting the content of the Website pages to specific User’s preferences and optimizing the websites website experience to the each User's individual needs.

Cookies usually contain the name of the website from which they originate, their storage time on the end device and a unique number. On our Website, we use the following types of Cookies:

  • "Session" – cookie files stored on the User's end device until the Uses logs out, leaves the website or turns off the web browser;
  • "Persistent" – cookie files stored on the User's end device for the time specified in the Cookie file parameters or until they are deleted by the User;
  • "Performance" – cookies used specifically for gathering data on how visitors use a website to measure the performance of a website;
  • "Strictly necessary" – essential for browsing the website and using its features, such as accessing secure areas of the site;
  • "Functional" – cookies enabling remembering the settings selected by the User and personalizing the User interface;
  • "First-party" – cookies stored by the Website;
  • "Third-party" – cookies derived from a website other than the Website;
  • "Facebook cookies" – You should read Facebook cookies policy: www.facebook.com
  • "Other Google cookies" – Refer to Google cookie policy: google.com

3. How System Logs work on the Website

User's activity on the Website, including the User’s Personal Data, is recorded in System Logs. The information collected in the Logs is processed primarily for purposes related to the provision of services, i.e. for the purposes of:

  • analytics – to improve the quality of services provided by us as part of the Website and adapt its functionalities to the needs of the Users. The legal basis for processing in this case is the legitimate interest of Nexocode consisting in analyzing Users' activities and their preferences;
  • fraud detection, identification and countering threats to stability and correct operation of the Website.

4. Cookie mechanism on the Website

Our site uses basic cookies that facilitate the use of its resources. Cookies contain useful information and are stored on the User's computer – our server can read them when connecting to this computer again. Most web browsers allow cookies to be stored on the User's end device by default. Each User can change their Cookie settings in the web browser settings menu: Google ChromeOpen the menu (click the three-dot icon in the upper right corner), Settings > Advanced. In the "Privacy and security" section, click the Content Settings button. In the "Cookies and site date" section you can change the following Cookie settings:

  • Deleting cookies,
  • Blocking cookies by default,
  • Default permission for cookies,
  • Saving Cookies and website data by default and clearing them when the browser is closed,
  • Specifying exceptions for Cookies for specific websites or domains

Internet Explorer 6.0 and 7.0
From the browser menu (upper right corner): Tools > Internet Options > Privacy, click the Sites button. Use the slider to set the desired level, confirm the change with the OK button.

Mozilla Firefox
browser menu: Tools > Options > Privacy and security. Activate the “Custom” field. From there, you can check a relevant field to decide whether or not to accept cookies.

Opera
Open the browser’s settings menu: Go to the Advanced section > Site Settings > Cookies and site data. From there, adjust the setting: Allow sites to save and read cookie data

Safari
In the Safari drop-down menu, select Preferences and click the Security icon.From there, select the desired security level in the "Accept cookies" area.

Disabling Cookies in your browser does not deprive you of access to the resources of the Website. Web browsers, by default, allow storing Cookies on the User's end device. Website Users can freely adjust cookie settings. The web browser allows you to delete cookies. It is also possible to automatically block cookies. Detailed information on this subject is provided in the help or documentation of the specific web browser used by the User. The User can decide not to receive Cookies by changing browser settings. However, disabling Cookies necessary for authentication, security or remembering User preferences may impact user experience, or even make the Website unusable.

5. Additional information

External links may be placed on the Website enabling Users to directly reach other website. Also, while using the Website, cookies may also be placed on the User’s device from other entities, in particular from third parties such as Google, in order to enable the use the functionalities of the Website integrated with these third parties. Each of such providers sets out the rules for the use of cookies in their privacy policy, so for security reasons we recommend that you read the privacy policy document before using these pages. We reserve the right to change this privacy policy at any time by publishing an updated version on our Website. After making the change, the privacy policy will be published on the page with a new date. For more information on the conditions of providing services, in particular the rules of using the Website, contracting, as well as the conditions of accessing content and using the Website, please refer to the the Website’s Terms and Conditions.

Nexocode Team

Close

Want to be a part of our engineering team?

Join our teal organization and work on challenging projects.

CHECK OPEN POSITIONS