Data is the new oil. It’s a commodity that every company needs more of - and they need it now. “How much training data do you need to build your AI models?” is one of the most common questions in the field. The answer varies depending on what it is that you’re trying to do. AI models usually require vast amounts of data to train them, but some datasets are so limited in size that it can be hard to know where to start or what to do next.
In this blog post, we’ll explore how much data is needed for different types of AI use cases and show you some tips on how you can get more data and expand your limited datasets using data augmentation techniques.
Data in Machine Learning vs. Conventional Programming
Machine learning has already conquered various sectors due to its great problem-solving potential. Its algorithms make machines imitate intelligent human behavior, which is an asset in data science, image processing, natural language processing, and various other fields. Of course, applied machine learning cannot replace conventional programming. Still, it’s second to none in terms of prediction and classification problems where it is not straightforward how to define the rules that make the program. While traditional algorithms use the coded logic to come up with answers, the ML ones use the existing ones to constitute their own logic. This way, they learn from the given examples, coming up with more accurate results every time.
Setting the rigorous mathematical approach aside enables solving complex problems efficiently, but it doesn’t come without a price. Machine learning algorithm doesn’t require rules, but it does need a lot of input data – and gathering enough may be a challenge. I guess, at this point, you may be wondering what does it really mean a lot? It’s no mystery that ML processes are data-intensive, but how does it refer to your particular project? Whether you’re considering the implementation of AI in your future project or have already done it but are not satisfied with the result, the information below may come in handy.
But before delving into the issue of the dataset’s size, let’s break down the way the ML algorithms are trained.
A Couple of Words About Training Data
As their name suggests, ML algorithms learn, find patterns, and develop a gradual understanding of the problem as a result of exposure to data. Once you gather as much information as you can at a certain point, you should divide it into training and testing purposes. Generally, the tendency is to split it in the proportion of 80% for training data to 20% for testing data. You’ll obviously need the most of it for training purposes, but don’t go for using it all. Testing data will allow you to verify the accuracy of the created model. You cannot validate it using the data it already knows – using the unknown dataset will expose its eventual weaknesses. After feeding the training data to the model, you tag the desired output – and in the testing phase, you verify whether it’s correct or not.
Harness the full potential of AI for your business
The training data may adopt different forms, depending on what problem you’re trying to solve – it can be numerical data, images, text, or audio. It’s essential to do some cleaning or preprocessing by removing the duplicated data and fixing structural errors. You may try to remove irrelevant data as well, but remember that in some cases (like, for instance, in stock market trend forecasting or other processes based on prediction), it’s hard to see this relevance clearly. In the end, your model will decide for you.
How Do You Determine the Size of a Data Set Needed to Train AI?
How much data will you need? The answer depends on the type of task your algorithm is supposed to fulfill, the method you use to achieve AI, and expected performance. In general, traditional machine learning algorithms will not need as much data as deep learning models. 1000 samples per category are considered a minimum for simplest machine learning algorithms, but it won’t be enough to solve the problem in most cases.
The more complex the problem is, the more training data you should have. The number of data samples should be proportional to the number of parameters. According to the so-called rule of 10, often used in dataset size estimation, you should have around 10 times more data samples than parameters. Of course, this rule is just a suggestion, and it may not apply in all the projects (some deep learning algorithms perform well at a 1:1 ratio). Still, it’s beneficial once you’re trying to estimate the minimum size of your dataset. Note, however, that some variables, like signal-to-noise ratio, may radically change this demand.
It’s worth remembering that the quality should stay high regardless of the problem’s complexity level. You may have heard that expanding the dataset should always be your primary goal, but according to our experience, it’s not worth put quantity over quality. Both are equally important.
Supervised Methods vs. Unsupervised Methods and Data
Supervised learning and unsupervised learning are two distinct approaches to training ML algorithms. The first one uses labeled data to come up with the most effective results, while the other deals with the unlabeled data sets to learn patterns with no human support. Unsupervised learning is perfect in situations where you need to identify patterns using complex training data. Sometimes, it is almost impossible (or too costly) to conduct data labeling, such as improving cybersecurity or providing direct intext labels needed for natural language processing tasks. Supervised learning is much more common across industries since it brings more accurate results, even though data labeling makes it more costly and time-consuming.
As you can see, in this case, your choice will not really determine the data quantity but its type and the labor input. The quantity is instead determined by how you create your machine learning model.
How Much Data Is Enough for Deep Learning?
As a subtype of machine learning that mimics the structure of the human brain, deep learning is much more capable of solving very complex problems even if the data isn’t structured. That’s because of the abilities of the neural network to pick out the features independently, reducing the required human engagement. Everything comes at a cost, though.
Training neural networks takes much more time than training regular ML models. That’s because the mechanism in which they process information is much more complex. The neural network uses artificial neurons to perform simple transformations that, multiplied to a great extent, constitute a complex problem-solving process. As a result, it needs incomparably more training data. And more data means more computational power to process, which generates considerable costs.
But coming back to the point – neural networks may require different sizes of datasets depending on the problem you’re trying to solve. For instance, the projects emulating human behavior in a sophisticated way (like advanced chatbots or robots) need millions of data points to give satisfactory results. But those that perform identification tasks – like image classification – should be fine with a few tens of thousands of data samples if their quality is good. That’s the standard number for commercial projects. On the other hand, simple ML algorithms serving, for instance, research purposes, will be fine with 10% of what you need for standard DL.
Too Much Data May Be Problematic as Well
It’s much more common to struggle with the lack of data than its excess - but it can happen. In general, it’s literally impossible to have too much quality data. However, as the training dataset size expands, it’s much harder to maintain the quality, which may affect your results in a negative way. Plus, although increasing the amount of data improves accuracy, it may stop bringing effects at some point. That’s the case, for example, with predictive models which stop reacting to more data after reaching a saturation point.
Going for quantity against all odds is an old-fashioned approach that loses staunch supporters with the rapid evolution of Machine Learning. And it’s not always the best choice from the business perspective. As we’ve already mentioned, gathering big data and storing it is costly, so it’s worth rethinking whether you need it at the moment. Training the model’s ability on excessive amounts of data will impact the final costs of the infrastructure used.
If you keep receiving an incorrect or inaccurate output in a testing phase, you may have fed your model with too little data. That’s not the best news since gathering data for training and testing purposes is often the most costly and time-consuming part of the AI implementation process. What to do in such a situation to avoid additional costs? There are a few options you may consider. Once you ensure that the issue is not the low quality (duplicated data, missing records, etc.) but the quantity, you can reach out for one of the following methods.
#1 Using Open-Source Data
Searching for open-source data is the first method we would recommend since it’s the least labor-intensive and completely free. Digging through the internet, you can find basically any data you could need but remember first to check the existing datasets – it will save your team a lot of manual work. Recycling an already trained model is a common practice that can save time and money, so why not do the same with the data itself?
Of course, the chances of finding available data for the fintech, manufacturing, or medical projects are way lower due to privacy issues. But for computer vision tasks like object detection and recognition, text language identification, or semantic analysis, you should be able to find plenty of sources. It’s worth remembering, though, that it’s essential to verify the license before reaching out for a particular dataset. While finding data for research will not be a problem, things get a little more complicated with commercial use.
Where to search for open-source datasets for machine learning projects? Some popular sources include
Kaggle,
Azure,
AWS, not to mention the most recognized
Google Datasets. Their repositories include both open-source and paid datasets.
#2 Data Augmentation
The internet is a goldmine for data, but it can disappoint you, especially if the problem you’re trying to solve with your classification model is niche. Then it’s time to roll up your sleeves and work with the limited training set you have at your disposal. Using a technique called data augmentation, you can provide your model with insufficient data without getting the new samples. This way, you’ll prevent it from learning bad patterns by working with the same samples repeatedly.
Data Augmentation Techniques and Examples
It’s enough to do minor modifications to your data samples to perform data augmentation. Your model may be quite smart for AI standards, but it’s still far from human intelligence, so once you change an image just a little, the model will consider it a different data sample. How to modify the data for the purposes of data augmentation? You can go for:
scaling
rotation
reflection
cropping
translating
adding Gaussian noise
Another option is to use more advanced techniques such as cutout regularization, mixup, neural style transfer, or applying GANS to create completely new examples. It is worth noting here, that data augmentation techniques can be applied to any type of data. They seem pretty straightforward and natural when it comes to image classification and recognition tasks. Just as well data augmentation can be applied to increase data size and expand the original dataset for numerical, tabular, time-series, and all other types of datasets.
Using data augmentation, you can deliver necessary data to the model at a limited cost. Once you perform it well, you can significantly improve the accuracy of the results. Is it always a good solution, though? What challenges does it bring? If you would like to read about this method and its techniques in a broader context, check our
article covering the topic of training neural networks with a lousy dataset and limited sample size. It also includes different solutions like data labeling and creating DIY datasets.
Summing Up
As you can see, there is no straight answer to the question of how much data you should gather for your AI project. The methods listed above should help you decide whether or not you should expand your dataset.
Estimation is much easier with an experienced partner who can look at your project from a different angle and provide valuable insights. If you’re looking for such support, don’t hesitate to contact us! We can help you carry out the AI projects from beginning to end and solve the issues with limited datasets cost-effectively.
Wojciech enjoys working with small teams where the quality of the code and the project's direction are essential. In the long run, this allows him to have a broad understanding of the subject, develop personally and look for challenges. He deals with programming in Java and Kotlin. Additionally, Wojciech is interested in Big Data tools, making him a perfect candidate for various Data-Intensive Application implementations.
Would you like to discuss AI opportunities in your business?
Let us know and Dorota will arrange a call with our experts.
Artificial Intelligence solutions are becoming the next competitive edge for many companies within various industries. How do you know if your company should invest time into emerging tech? How to discover and benefit from AI opportunities? How to run AI projects?
Follow our article series to learn how to get on a path towards AI adoption. Join us as we explore the benefits and challenges that come with AI implementation and guide business leaders in creating AI-based companies.
In the interests of your safety and to implement the principle of lawful, reliable and transparent
processing of your personal data when using our services, we developed this document called the
Privacy Policy. This document regulates the processing and protection of Users’ personal data in
connection with their use of the Website and has been prepared by Nexocode.
To ensure the protection of Users' personal data, Nexocode applies appropriate organizational and
technical solutions to prevent privacy breaches. Nexocode implements measures to ensure security at
the level which ensures compliance with applicable Polish and European laws such as:
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on
the protection of natural persons with regard to the processing of personal data and on the free
movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)
(published in the Official Journal of the European Union L 119, p 1);
Act of 10 May 2018 on personal data protection (published in the Journal of Laws of 2018, item
1000);
Act of 18 July 2002 on providing services by electronic means;
Telecommunications Law of 16 July 2004.
The Website is secured by the SSL protocol, which provides secure data transmission on the Internet.
1. Definitions
User – a person that uses the Website, i.e. a natural person with full legal capacity, a legal
person, or an organizational unit which is not a legal person to which specific provisions grant
legal capacity.
Nexocode – NEXOCODE sp. z o.o. with its registered office in Kraków, ul. Wadowicka 7, 30-347 Kraków, entered into the Register of Entrepreneurs of the National Court
Register kept by the District Court for Kraków-Śródmieście in Kraków, 11th Commercial Department
of the National Court Register, under the KRS number: 0000686992, NIP: 6762533324.
Website – website run by Nexocode, at the URL: nexocode.com whose content is available to
authorized persons.
Cookies – small files saved by the server on the User's computer, which the server can read when
when the website is accessed from the computer.
SSL protocol – a special standard for transmitting data on the Internet which unlike ordinary
methods of data transmission encrypts data transmission.
System log – the information that the User's computer transmits to the server which may contain
various data (e.g. the user’s IP number), allowing to determine the approximate location where
the connection came from.
IP address – individual number which is usually assigned to every computer connected to the
Internet. The IP number can be permanently associated with the computer (static) or assigned to
a given connection (dynamic).
GDPR – Regulation 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
protection of individuals regarding the processing of personal data and onthe free transmission
of such data, repealing Directive 95/46 / EC (General Data Protection Regulation).
Personal data – information about an identified or identifiable natural person ("data subject").
An identifiable natural person is a person who can be directly or indirectly identified, in
particular on the basis of identifiers such as name, identification number, location data,
online identifiers or one or more specific factors determining the physical, physiological,
genetic, mental, economic, cultural or social identity of a natural person.
Processing – any operations performed on personal data, such as collecting, recording, storing,
developing, modifying, sharing, and deleting, especially when performed in IT systems.
2. Cookies
The Website is secured by the SSL protocol, which provides secure data transmission on the Internet.
The Website, in accordance with art. 173 of the Telecommunications Act of 16 July 2004 of the
Republic of Poland, uses Cookies, i.e. data, in particular text files, stored on the User's end
device. Cookies are used to:
improve user experience and facilitate navigation on the site;
help to identify returning Users who access the website using the device on which Cookies were
saved;
creating statistics which help to understand how the Users use websites, which allows to improve
their structure and content;
adjusting the content of the Website pages to specific User’s preferences and optimizing the
websites website experience to the each User's individual needs.
Cookies usually contain the name of the website from which they originate, their storage time on the
end device and a unique number. On our Website, we use the following types of Cookies:
"Session" – cookie files stored on the User's end device until the Uses logs out, leaves the
website or turns off the web browser;
"Persistent" – cookie files stored on the User's end device for the time specified in the Cookie
file parameters or until they are deleted by the User;
"Performance" – cookies used specifically for gathering data on how visitors use a website to
measure the performance of a website;
"Strictly necessary" – essential for browsing the website and using its features, such as
accessing secure areas of the site;
"Functional" – cookies enabling remembering the settings selected by the User and personalizing
the User interface;
"First-party" – cookies stored by the Website;
"Third-party" – cookies derived from a website other than the Website;
"Facebook cookies" – You should read Facebook cookies policy: www.facebook.com
"Other Google cookies" – Refer to Google cookie policy: google.com
3. How System Logs work on the Website
User's activity on the Website, including the User’s Personal Data, is recorded in System Logs. The
information collected in the Logs is processed primarily for purposes related to the provision of
services, i.e. for the purposes of:
analytics – to improve the quality of services provided by us as part of the Website and adapt
its functionalities to the needs of the Users. The legal basis for processing in this case is
the legitimate interest of Nexocode consisting in analyzing Users' activities and their
preferences;
fraud detection, identification and countering threats to stability and correct operation of the
Website.
4. Cookie mechanism on the Website
Our site uses basic cookies that facilitate the use of its resources. Cookies contain useful
information
and are stored on the User's computer – our server can read them when connecting to this computer
again.
Most web browsers allow cookies to be stored on the User's end device by default. Each User can
change
their Cookie settings in the web browser settings menu:
Google ChromeOpen the menu (click the three-dot icon in the upper right corner), Settings >
Advanced. In
the "Privacy and security" section, click the Content Settings button. In the "Cookies and site
date"
section you can change the following Cookie settings:
Deleting cookies,
Blocking cookies by default,
Default permission for cookies,
Saving Cookies and website data by default and clearing them when the browser is closed,
Specifying exceptions for Cookies for specific websites or domains
Internet Explorer 6.0 and 7.0
From the browser menu (upper right corner): Tools > Internet Options >
Privacy, click the Sites button. Use the slider to set the desired level, confirm the change with
the OK
button.
Mozilla Firefox
browser menu: Tools > Options > Privacy and security. Activate the “Custom” field.
From
there, you can check a relevant field to decide whether or not to accept cookies.
Opera
Open the browser’s settings menu: Go to the Advanced section > Site Settings > Cookies and site
data. From there, adjust the setting: Allow sites to save and read cookie data
Safari
In the Safari drop-down menu, select Preferences and click the Security icon.From there,
select
the desired security level in the "Accept cookies" area.
Disabling Cookies in your browser does not deprive you of access to the resources of the Website.
Web
browsers, by default, allow storing Cookies on the User's end device. Website Users can freely
adjust
cookie settings. The web browser allows you to delete cookies. It is also possible to automatically
block cookies. Detailed information on this subject is provided in the help or documentation of the
specific web browser used by the User. The User can decide not to receive Cookies by changing
browser
settings. However, disabling Cookies necessary for authentication, security or remembering User
preferences may impact user experience, or even make the Website unusable.
5. Additional information
External links may be placed on the Website enabling Users to directly reach other website. Also,
while
using the Website, cookies may also be placed on the User’s device from other entities, in
particular
from third parties such as Google, in order to enable the use the functionalities of the Website
integrated with these third parties. Each of such providers sets out the rules for the use of
cookies in
their privacy policy, so for security reasons we recommend that you read the privacy policy document
before using these pages.
We reserve the right to change this privacy policy at any time by publishing an updated version on
our
Website. After making the change, the privacy policy will be published on the page with a new date.
For
more information on the conditions of providing services, in particular the rules of using the
Website,
contracting, as well as the conditions of accessing content and using the Website, please refer to
the
the Website’s Terms and Conditions.
Nexocode Team
Want to unlock the full potential of Artificial Intelligence technology?
Download our ebook and learn how to drive AI adoption in your business.