What is Data Augmentation in Machine Learning - Part I
Discover what is data augmentation in Machine Learning, and how it can help solve the problem of imbalanced and/or insufficient dataset
Most 2 years old kids are able to recognize a cat after having seen a small number of them. The ability that humans (and many animals) have - to recognize things after having observed them only a single time, has long perplexed Artificial Intelligence researchers. Up until recently, it was completely impossible for machines to match the ability of humans or animals to recognize objects visually. Today, AI algorithms can match (and often surpass) humans’ ability to recognize objects in pictures, but to do so, they still need a very large amount of data.
For example, the first algorithm capable of beating humans at recognizing objects in images, called Alexnet (named after the author, whose name is, unsurprisingly, Alex Krizhevsky), required a whopping 14 million images annotated by humans to be able to recognize around 20.000 different objects - about 700 images per object on average.
Does it mean that if you don’t hold very large datasets, the doors of AI are closed to you? Fortunately, it does not. There is a “magic trick” to apply for projects with limited datasets, called data augmentation.
Purpose of Data augmentation
In order to train a machine learning model, you need plenty of data examples. Altogether, these examples form a training dataset, that enables your machine learning model to learn. The collected dataset should be representative enough, and should contain all possible cases and objectives you want the AI model to understand. These requirements often represent a key challenge.
Data collection: a bottleneck in many projects
Data collection is the process of gathering numerous examples of the data adapted to your objective. The goal for all data collection is to capture as much evidence as possible.
In many projects, this can be a big bottleneck. There are mainly two reasons why it can be difficult to do so. Firstly, you may lack existing datasets for some specific AI applications. For example, building a recommender engine for products for a new application where data from customers has not been collected yet.
Secondly, you may lack specific types of data or sufficient labeled data.
In these cases, data augmentation can be used.
General principles of data augmentation
What is data augmentation
Data augmentation involves the process of creating new data samples by manipulating the original data.
Objectives of data augmentation
There are two main objectives for data augmentation. The first possible objective of data augmentation is in situations when a dataset is imbalanced. It means that you have too many examples of some targets and not enough for others. Data augmentation will help you put some balance back into the dataset.
The second objective is when a dataset is too small. You will have to perform augmentation of your dataset so your model has more examples to learn from. Depending on the target, your data augmentation can range from small and individual alterations to undertaking an entire transformation of the data.
Let’s take the example of an algorithm for which you need a lot of images. You can start with a limited set of data, and make it more diverse by transforming these images: mirroring, resizing, cropping, and more. This way, data augmentation increases the diversity of the data available for training AI models.
As a result, you can solve the problem of not having enough data samples.
Examples per type of data
Images are a great way to illustrate data augmentation. In order to train an algorithm to recognize an image of a dog, you will need a training dataset that contains different images of a dog. In case of having too many images of a dog that look the same, you can enrich your training dataset with data augmentation to avoid overfitting.
One way of enriching your training dataset can be by simply flipping (mirroring) the picture of a dog. The output would be a flipped picture of a dog, so the target of the augmentation (or the target label) will match the desired output - a picture of a dog.
For textual data it is also possible to apply adequate augmentations: one can introduce synonyms, manipulate punctuation, etc.
If you are dealing with tabular data you can add some changes in specific variables.
Depending on what is your objective, some of the transformations you can do are adding noise, doing some equalizer transformations, cut-offs, etc.
In our second article Data Augmentation, Part II, we will dig deeper into the techniques for data augmentation.
If you are curious to discover more about AI, subscribe to our monthly newsletter where we regularly share insights about applied machine learning.