How Quality Data Sets Help Machines Learn

2021-03-10 -

Just like the fuel that runs the engine of a vehicle, a data set is a key ingredient for solving a given problem using machine learning or deep learning algorithms. This article takes a quick look at open source data sets for ML and DL, with a focus on TensorFlow and PyTorch.

There is a lot of research going on in and a lot of applications being built with machine learning. Machine learning (ML) algorithms automatically transform data into useful representations for the task in hand. These operations could be coordinate changes, linear projections (linear regression), translations

(SVM), transformations (PCA), and so on. These algorithms are not usually creative. They just search in a hypothesis space.

Deep learning (DL) is a specific field of machine learning which emphasises learning in successive layers to get more meaningful representations of data. Deep in deep learning represents the number of layers or the depth of the layers used in it.

There are two kinds of problems in machine learning — regression and classification. Just suppose we are required to predict the cost of a house. Here, we need to get a numeric value. This becomes a regression problem. However, sometimes we need to take a “Yes” or “No” decision. This is a binary classification problem. At other times, we need to predict a class. Consider a problem where we are required to identify an animal like a cat, dog, horse, etc. This is a multi-class classification problem.

Whenever we get the data for a problem for supervised learning, we get attributes and labels. Attributes are independent variables with which we must predict the labels or dependent variables. Attributes are called inputs, predictors, features, or independent variables. Labels are called outputs, targets, outcomes, or dependent variables. If the labels are categorical, the problem becomes a classification problem; if these are numerical, it becomes a regression problem.

Importance of data sets for ML and DL

Machine learning or deep learning tries to learn patterns from the data available using algorithms. Data is the core of any ML/AI algorithm. Training data for ML is a key input to the algorithm that learns from it, and memorises the information for future prediction. Training data is the backbone of the entire machine learning project, without which it is not possible to train a machine that learns from humans. Raw data collected from various sources needs to be labelled or annotated so that it can be used to predict in the case of supervised machine learning.

Availability of such annotated data is very important to train and validate the algorithms. The quality and size of the data set is very important for training of the learning method in order to obtain good results and high performing models. Raw

data cannot be directly used for training; it needs to be cleaned. If data is not good, a considerable amount of time needs to be spent to pre-process it so that it can be used. Data preprocessing takes a lot of time and effort to generate a quality data set. Hence data sets play a predominant role in training of the machine learning and deep learning models.

A quality data set saves the time and effort needed for machine learning and deep learning researchers and data scientists. A few years back everyone would generate data from different sources required for the learning algorithms. There were no standard data sets available and these were stored in private and public locations of the user’s choice. It was difficult to generalise the insights, make developments to learning algorithms, and enable data sets to be explored by wider audiences. As machine learning and deep learning technologies advanced, their frameworks started providing built-in data sets as part of their packages. This would enable users to access similar data sets, save the effort and time required for data preprocessing, and direct their knowledge to building high performing and complex models that provided advanced insights. Today, machine learning and deep learning frameworks like Scikit-learn, TensorFlow, and PyTorch (among many others) provide built-in data sets for almost all types of data like text, image, audio and video, with sufficient samples for training, validation and testing of models.

Libraries with built-in data sets

Lots of libraries nowadays come with built-in data sets that can be used to train machine learning algorithms. Built-in data sets prove to be very useful when it comes to practising ML algorithms — you need some random, yet sensible data to apply the techniques and get your hands dirty. Many modules in Python contain some common data sets similar to the popular ‘Iris’ data, MNIST digits data and Boston housing price data set.

Scikit-learn comes with a few small standard data sets that do not require one to download any file from some external website. Similarly, TensorFlow provides the TensorFlow data set for use. PyTorch also provides a lot of data sets for use in machine learning and deep learning. There are many data sets available but in this article, we will only discuss the usage of TensorFlow and PyTorch data sets.

Types of data sets available

There are different types of data sets available, as listed below. The type of data set used depends on the problem at hand.

Audio data: This is basically useful in machine learning and deep learning tasks such as speech recognition and emotion recognition. Some audio data is also used for identification of diseases like Alzheimer’s and Covid-19 (the audio data of people coughing can be used to identify Covid-19). The speech data of people talking is also used to classify those with Alzheimer’s disease.

Image data: This is used for classification and object detection. Image data available with TensorFlow such as COCO and Wider-face can be used for object detection. Wider-face is a data set of images with pictures of people, and it can be used in deep learning to identify faces. Image data can also be used for deep learning algorithms such as segmentation, which can be used for self-driving cars.

Text data: Text data can be used for various natural language processing tasks such as movie review classification, fake news detection, summarisation, answering questions, topic detection, transcript summarisation, and detection of action items from emails, etc. An example of text data is the collection of email messages sent by employees of Enron Corporation.

Video data: Video data can be used for object detection in videos and for video segmentation. An example is the DAVIS (Densely Annotated VIdeo Segmentation) data set from TFDS (TensorFlow data set).

Translation data sets: Labelled data is also available for translation. These data sets can be used for machine translation tasks for different languages.

Given below are a few of the standard built-in data sets that are provided as part of the deep learning and machine learning frameworks. Each framework provides a superset of these data sets. The data sets listed below showcase the diversity provided by these frameworks so that users can exploit them to their advantage for building models catering to their domain.

■ CelebA (CelebFaces Attributes) is a large-scale face attributes data set with more than 200,000 celebrity images, each with 40 attribute annotations. It can be used for face attribute recognition, face detection and landmark (or facial part) localisation.

■ CIFAR-10 (Canadian Institute for Advanced Research) data set consists of 60,000 32x32 colour images, categorised into 10 classes with 6000 images per class. It can be used for image classification and computer vision tasks.

■ Cityscapes data set consists of

25,000 samples of video sequences recorded in street scenes, with pixellevel annotations. It can be used for classification and object detection.

■ COCO (Common Objects in Context) data set consists of images of daily scenes of common objects in their natural context. It has around

2.5 million images with labelled text which can be used for object recognition, classification and caption generation.

■ Fashion-MNIST (Modified

National Institute of Standards and Technology) data set consists of images similar to the MNIST data set but from fashion product databases. It has around 60,000 images and can be used for image classification.

■ HMDB (Human Metabolome Database) is collected from a variety of sources, most of which

are movies. But a small proportion has also been obtained from open source databases such as the Prelinger archive, YouTube, and Google videos. It has around 7000 clips divided into 51 action categories, each containing over 101 clips, and can be used for object recognition and action detection.

■ Kinetics is a collection of largescale data sets of URL links of up to 650,000 video clips that cover various human action classes, depending on the data set version. The videos include human-object interactions and human-human interactions. Each action class has at least 400/600/700 video clips. Each clip is annotated, and can be used for object recognition and action detection.

■ LSUN data set contains around one million labelled images for each of the 10 scene categories and 20 object categories. It can be used for understanding scenes with many ancillary tasks like room layout estimation, saliency prediction, etc. Table 1 lists a few more large data sets that are available today.

Using the TensorFlow data sets (TFDS)

TFDS provides a beautiful collection of data sets that can be readily used in TensorFlow, Python and other machine learning frameworks.

The following categories of data sets are available in TFDS: audio, image, image classification, object detection, questions and answers, structured, summarisation, text, translate, video and vision language. Each of these data sets can be used for a variety of machine learning and deep learning tasks.

Installation of these data sets can be done from two packages available with TFDS.

PIP INSTALL TENSORflOW-DATASETS:

This is the stable version, released every few months.

PIP INSTALL TFDS-NIGHTLY: Released every day, it contains the latest versions of the data sets.

We can import the data sets using the following commands:

All data set builders are a subclass of TFDS.CORE.DATASETBUILDER. To get the list of available builders, use TFDS. LIST_BUILDERS().

The following example taken from TENSORflOW.ORG explains the usage of the TensorFlow data set MNIST.

1) Load MNIST

Load with the following arguments.

■ shuffle_files: The MNIST data is only stored in a single file, but for larger data sets with multiple files on disk, it’s a good practice to shuffle them when training.

■ as_supervised: Returns a tuple (IMG, label) instead of dict {‘IMAGE’:

IMG, ‘LABEL’: LABEL} (DS_TRAIN, DS_TEST), DS_INFO = TFDS.LOAD( ‘MNIST’, SPLIT=[‘TRAIN’, ‘TEST’], SHUFflE_ fiLES=TRUE, AS_SUPERVISED=TRUE, WITH_INFO=TRUE,)

2) Build training pipeline

Apply the following transformations.

■ ds.map: TFDS provides the images as TF.UINT8, while the model expects TF.flOAT32; so normalise the images.

■ ds.cache: As the data set fits in memory, cache before shuffling for better performance.

■ ds.shuffle: For true randomness, set the shuffle buffer to the full data set size.

Note: For bigger data sets that do not fit in memory, a standard value is 1000 if your system allows it.

■ ds.batch: Batch after shuffling to get unique batches at each epoch.

■ ds.prefetch: It’s a good practice to end the pipeline by prefetching for performances. 3) Build evaluation pipeline

Testing pipeline is similar to the training pipeline, with a small difference — there is no DS.SHUFflE() call.

Caching is done after batching

(as batches can be the same between epochs): 4) Create and train the model Plug the input pipeline into Keras:

Using the data sets of PyTorch

Like any other ML or DL framework, PyTorch also has built-in data sets that can be explored for various applications. To perform ETL (extract, transform and load) on a given data set, PyTorch provides the two main classes given below.

■ Dataset: This is an abstract class representing a data set.

■ DataLoader: This is a Python iterable over the data set; so it wraps a data set and provides access to underlying data. The Dataset abstract class has two methods, __LEN__() and __GET_ITEM__(), which need to be implemented for custom data sets by extending this class. The data sets can be passed to DataLoader object to load multiple samples in parallel by using multiprocessing worker modules of PyTorch.

The details of how to create a new data set and use it in PyTorch are given below, using the classes discussed above.

To generate a new data set, for example “NEWDATASET”, it needs to be extended from the built-in abstract class TORCH.UTILS.DATA.DATASET. Post that, the initialisation __INIT__(), __LEN__() and __GETITEM__() functions need to be overridden and the corresponding implementation needs to be provided. A sample code to illustrate the defining of the NEWDATASET class using the Dataset abstract class of PyTorch is given below:

Now the DataLoader class of PyTorch provides an interface to use the data set generated above from the Dataset class.

The above parameters are passed to the DataLoader class and their purpose is given below.

batch_size: This denotes the number of samples contained in each generated batch; generally, this is multiples of 8 (8, 32, 64, 128…).

shuffle: The samples in the batch are shuffled for each epoch, so that batches between epochs are not similar for training the model. This allows the model to be more robust during training. Generally, it is set to TRUE. Setting it to False will allow the use of the same samples in batches across training epochs.

num_workers: This denotes the number of processes that generate batches in parallel; a high number of workers will allow CPU computations to be managed efficiently.

Data is important for machine learning and deep learning algorithms to perform better. Quality data is what is needed to get a good performing model. Acquiring data, cleaning the data for quality and annotating it is a tedious task, and consumes a considerable amount of time if started from scratch. Hence, pre-built data sets save time and allow one to think about improving the learning methods rather than concentrating on acquiring data. Most of the machine learning and deep learning frameworks or platforms, therefore, provide built-in data sets as part of the package. These data sets generally come in all flavours to be used for various applications like classification, object recognition, detection, segmentation, caption generation, sentiment analysis, emotion detection and action detection. Moving a step ahead, a few frameworks like PyTorch have improved the way the data set is handled for building models. This has significantly helped data scientists to save time, build high performing models, develop insights, and make considerable progress in deep learning and machine learning technologies.

?? ?? Image Source: https://www.freepik.com/ — Image Source: https://www.freepik.com/

?? ?? Table 1: Large data sets — Table 1: Large data sets

How Quality Data Sets Help Machines Learn

Newspapers in English

Newspapers from India