How Much Data Does AI Need? What to Do When You Have Limited Datasets?

Data is the new oil. It’s a commodity that every company needs more of - and they need it now. “How much training data do you need to build your AI models?” is one of the most common questions in the field. The answer varies depending on what it is that you’re trying to do. AI models usually require vast amounts of data to train them, but some datasets are so limited in size that it can be hard to know where to start or what to do next.

In this blog post, we’ll explore how much data is needed for different types of AI use cases and show you some tips on how you can get more data and expand your limited datasets using data augmentation techniques.

Data in Machine Learning vs. Conventional Programming

Machine learning has already conquered various sectors due to its great problem-solving potential. Its algorithms make machines imitate intelligent human behavior, which is an asset in data science, image processing, natural language processing, and various other fields. Of course, applied machine learning cannot replace conventional programming. Still, it’s second to none in terms of prediction and classification problems where it is not straightforward how to define the rules that make the program. While traditional algorithms use the coded logic to come up with answers, the ML ones use the existing ones to constitute their own logic. This way, they learn from the given examples, coming up with more accurate results every time.

Setting the rigorous mathematical approach aside enables solving complex problems efficiently, but it doesn’t come without a price. Machine learning algorithm doesn’t require rules, but it does need a lot of input data – and gathering enough may be a challenge. I guess, at this point, you may be wondering what does it really mean a lot? It’s no mystery that ML processes are data-intensive, but how does it refer to your particular project? Whether you’re considering the implementation of AI in your future project or have already done it but are not satisfied with the result, the information below may come in handy.

But before delving into the issue of the dataset’s size, let’s break down the way the ML algorithms are trained.

A Couple of Words About Training Data

As their name suggests, ML algorithms learn, find patterns, and develop a gradual understanding of the problem as a result of exposure to data. Once you gather as much information as you can at a certain point, you should divide it into training and testing purposes. Generally, the tendency is to split it in the proportion of 80% for training data to 20% for testing data. You’ll obviously need the most of it for training purposes, but don’t go for using it all. Testing data will allow you to verify the accuracy of the created model. You cannot validate it using the data it already knows – using the unknown dataset will expose its eventual weaknesses. After feeding the training data to the model, you tag the desired output – and in the testing phase, you verify whether it’s correct or not.

Harness the full potential of AI for your business

The training data may adopt different forms, depending on what problem you’re trying to solve – it can be numerical data, images, text, or audio. It’s essential to do some cleaning or preprocessing by removing the duplicated data and fixing structural errors. You may try to remove irrelevant data as well, but remember that in some cases (like, for instance, in stock market trend forecasting or other processes based on prediction), it’s hard to see this relevance clearly. In the end, your model will decide for you.

How Do You Determine the Size of a Data Set Needed to Train AI?

How much data will you need? The answer depends on the type of task your algorithm is supposed to fulfill, the method you use to achieve AI, and expected performance. In general, traditional machine learning algorithms will not need as much data as deep learning models. 1000 samples per category are considered a minimum for simplest machine learning algorithms, but it won’t be enough to solve the problem in most cases.

The more complex the problem is, the more training data you should have. The number of data samples should be proportional to the number of parameters. According to the so-called rule of 10, often used in dataset size estimation, you should have around 10 times more data samples than parameters. Of course, this rule is just a suggestion, and it may not apply in all the projects (some deep learning algorithms perform well at a 1:1 ratio). Still, it’s beneficial once you’re trying to estimate the minimum size of your dataset. Note, however, that some variables, like signal-to-noise ratio, may radically change this demand.

It’s worth remembering that the quality should stay high regardless of the problem’s complexity level. You may have heard that expanding the dataset should always be your primary goal, but according to our experience, it’s not worth put quantity over quality. Both are equally important.

Supervised Methods vs. Unsupervised Methods and Data

Supervised learning and unsupervised learning are two distinct approaches to training ML algorithms. The first one uses labeled data to come up with the most effective results, while the other deals with the unlabeled data sets to learn patterns with no human support. Unsupervised learning is perfect in situations where you need to identify patterns using complex training data. Sometimes, it is almost impossible (or too costly) to conduct data labeling, such as improving cybersecurity or providing direct intext labels needed for natural language processing tasks. Supervised learning is much more common across industries since it brings more accurate results, even though data labeling makes it more costly and time-consuming.

As you can see, in this case, your choice will not really determine the data quantity but its type and the labor input. The quantity is instead determined by how you create your machine learning model.

How Much Data Is Enough for Deep Learning?

As a subtype of machine learning that mimics the structure of the human brain, deep learning is much more capable of solving very complex problems even if the data isn’t structured. That’s because of the abilities of the neural network to pick out the features independently, reducing the required human engagement. Everything comes at a cost, though.

Check this series

Training neural networks takes much more time than training regular ML models. That’s because the mechanism in which they process information is much more complex. The neural network uses artificial neurons to perform simple transformations that, multiplied to a great extent, constitute a complex problem-solving process. As a result, it needs incomparably more training data. And more data means more computational power to process, which generates considerable costs.

But coming back to the point – neural networks may require different sizes of datasets depending on the problem you’re trying to solve. For instance, the projects emulating human behavior in a sophisticated way (like advanced chatbots or robots) need millions of data points to give satisfactory results. But those that perform identification tasks – like image classification – should be fine with a few tens of thousands of data samples if their quality is good. That’s the standard number for commercial projects. On the other hand, simple ML algorithms serving, for instance, research purposes, will be fine with 10% of what you need for standard DL.

Too Much Data May Be Problematic as Well

It’s much more common to struggle with the lack of data than its excess - but it can happen. In general, it’s literally impossible to have too much quality data. However, as the training dataset size expands, it’s much harder to maintain the quality, which may affect your results in a negative way. Plus, although increasing the amount of data improves accuracy, it may stop bringing effects at some point. That’s the case, for example, with predictive models which stop reacting to more data after reaching a saturation point.

Going for quantity against all odds is an old-fashioned approach that loses staunch supporters with the rapid evolution of Machine Learning. And it’s not always the best choice from the business perspective. As we’ve already mentioned, gathering big data and storing it is costly, so it’s worth rethinking whether you need it at the moment. Training the model’s ability on excessive amounts of data will impact the final costs of the infrastructure used.

What to Do When Your Datasets Are Limited?

If you keep receiving an incorrect or inaccurate output in a testing phase, you may have fed your model with too little data. That’s not the best news since gathering data for training and testing purposes is often the most costly and time-consuming part of the AI implementation process. What to do in such a situation to avoid additional costs? There are a few options you may consider. Once you ensure that the issue is not the low quality (duplicated data, missing records, etc.) but the quantity, you can reach out for one of the following methods.

#1 Using Open-Source Data

Searching for open-source data is the first method we would recommend since it’s the least labor-intensive and completely free. Digging through the internet, you can find basically any data you could need but remember first to check the existing datasets – it will save your team a lot of manual work. Recycling an already trained model is a common practice that can save time and money, so why not do the same with the data itself?

Of course, the chances of finding available data for the fintech, manufacturing, or medical projects are way lower due to privacy issues. But for computer vision tasks like object detection and recognition, text language identification, or semantic analysis, you should be able to find plenty of sources. It’s worth remembering, though, that it’s essential to verify the license before reaching out for a particular dataset. While finding data for research will not be a problem, things get a little more complicated with commercial use.

Where to search for open-source datasets for machine learning projects? Some popular sources include Kaggle, Azure, AWS, not to mention the most recognized Google Datasets. Their repositories include both open-source and paid datasets.

#2 Data Augmentation

The internet is a goldmine for data, but it can disappoint you, especially if the problem you’re trying to solve with your classification model is niche. Then it’s time to roll up your sleeves and work with the limited training set you have at your disposal. Using a technique called data augmentation, you can provide your model with insufficient data without getting the new samples. This way, you’ll prevent it from learning bad patterns by working with the same samples repeatedly.

Various examples of data augmentation techniques

Data Augmentation Techniques and Examples

It’s enough to do minor modifications to your data samples to perform data augmentation. Your model may be quite smart for AI standards, but it’s still far from human intelligence, so once you change an image just a little, the model will consider it a different data sample. How to modify the data for the purposes of data augmentation? You can go for:

scaling
rotation
reflection
cropping
translating
adding Gaussian noise

Another option is to use more advanced techniques such as cutout regularization, mixup, neural style transfer, or applying GANS to create completely new examples. It is worth noting here, that data augmentation techniques can be applied to any type of data. They seem pretty straightforward and natural when it comes to image classification and recognition tasks. Just as well data augmentation can be applied to increase data size and expand the original dataset for numerical, tabular, time-series, and all other types of datasets.

Using data augmentation, you can deliver necessary data to the model at a limited cost. Once you perform it well, you can significantly improve the accuracy of the results. Is it always a good solution, though? What challenges does it bring? If you would like to read about this method and its techniques in a broader context, check our article covering the topic of training neural networks with a lousy dataset and limited sample size. It also includes different solutions like data labeling and creating DIY datasets.

Summing Up

As you can see, there is no straight answer to the question of how much data you should gather for your AI project. The methods listed above should help you decide whether or not you should expand your dataset.

Estimation is much easier with an experienced partner who can look at your project from a different angle and provide valuable insights. If you’re looking for such support, don’t hesitate to contact us! We can help you carry out the AI projects from beginning to end and solve the issues with limited datasets cost-effectively.

About the author

Wojciech Marusarz

Software Engineer