In this month’s column, we discuss a few more questions on machine learning and deep learning.
As we have been doing over the last couple of months, we will continue to discuss a few more interview questions in this month’s column as well, focusing on topics in machine learning. Let us start off with a simple question.
1. Let us assume that you have a neural network in which there is only an input layer (with a single input X being a number) and an output layer with a single output neuron. There are no hidden layers. You can assume any kind of output layer activation, be it linear, sigmoid, softmax, tanh or relU. The output needs to be 0 for X < 6 and 1 for X > 6. How would you design your neural network?
2. Is it possible for a classifier to have both high bias and high variance? In that case, would the training set error be greater than the test set error or vice versa? How can you recognise such a situation —whether it is possible for a classifier to have both high bias and high variance?
3. You are working on developing a part of a large image classification system of indoor household images taken from different cameras placed inside the house. Such a system can be part of a smart home project or for home security considerations. Since many of your customers have pet cats, your system needs to be able to recognise cats. You have designed a machine learning system that can recognise cat images from non-cat images. Note that the household cameras being used in the setup are quite cheap, and can end up producing poor quality images due to variations in lighting, camera motion, etc. Since there are zillions of cat images on the Internet, you have decided to train your system using pet images from the Internet, which are typically high quality. You have downloaded 100,000 labelled pet images from the Internet to train your cat-recognition classifier. You have also obtained 5000 home-camera generated images, which may include cat images as well. You are planning to use 2500 of these 5000 images as the validation set for tuning the hyperparameters. The remaining 2500 images would be used as the test set for final evaluation. After you have finished developing the initial model and are about to start the next step, which is the hyperparameter tuning exercise, your friends tell you that your methodology is incorrect because your training set and validation set come from different data distributions and, hence, it is likely that your chosen model would perform badly on the test set. Would you agree with them? If not, explain why. If yes, explain how you would change your methodology. Note that you cannot increase the amount of indoor household camera image data, which is only 5000 images, of which a few hundred are cat images.
4. You are designing a classifier which will analyse data relating to Bengaluru traffic fines and build a classification system which will predict whether the fine for a particular violation will be paid or not. The input data set has information on the offender, such as the name, residential street name, house number, locality pin code, whether the person is a repeat offender, age, gender, information regarding what type of traffic violation occurred, the date of violation, where it happened, information regarding the fine such as the amount, any late fee, administrative charges, etc. Note that the different variables in the input data set are of different data types, with the type of traffic violation represented in string format, the residential street number being a numerical quantity, and the amount of the fine being a floating point number. The target variable you are predicting is a binary variable which is 1 if the fine will be paid, or 0 if the fine will not be paid. You have carefully analysed the data
set, and have included both the age of the offender and the residential street number as features in your classifier. Remember that both of these are numeric quantities. While you are confident of your system working well, your friends suggest that using the ‘street number’ as the numeric feature is incorrect. They suggest that it has to be combined with the ‘street name’ and the combined quantity needs to be encoded as a ‘categorical’ variable. Do you agree with your friends? If yes, explain why. If not, explain why you believe their reasoning is incorrect.
5. Let us consider Question (3) from above. You decided to add 3000 images from the indoor cameras to the training set of 100,000 Internet pet images. You then decided to use 1000 images from the indoor cameras as the validation/development set for hyper-parameter tuning. The remaining 1000 indoor camera images will be used as the test set. Given that you have only 3000 images in your training set which are actually from the application, for which you are building your classifier, you still have the issue of training and validation/test data distributions being different. If your trained model does not do well on the validation/development data set, how would you find out whether it is because of under-fitting or not being able to generalise well? Essentially, can you come up with a method of splitting your input data set so that you can find and fix the issue of the classifier under-fitting to the training data?
6. One way to address the issue mentioned in Question (5) above is to split the training set into two parts, with one part being used for training and the other part known as the training-dev set, which will be used to evaluate whether the classifier can fit well with the data coming from the same distribution as the training data. Note that this split of ‘training-dev’ data set is needed only when the training data distribution and validation/test data distribution are different. The purpose of the ‘training-dev’ data set is to help determine whether the classifier can generalise well to unseen held-out data coming from the same distribution as the training data. You are told that the training error is 4 per cent and trainingdev error is 9 per cent. You have built your classifier using an ‘N layer feed forward’ neural network. Now how would you improve the performance of the classifier on the training-dev set? Would you choose to increase the number of hidden layers or would you choose to increase the size of the training data? Explain the rationale behind your choice.
7. In last month’s column, we had discussed full-batch gradient descent, mini-batch gradient descent and stochastic gradient descent for the optimisation of the cost function of the neural networks. In stochastic gradient descent and mini-batch gradient descent, the optimisation process can lead to local oscillations as it tries to move towards the minimum. This can result in the algorithm taking a larger number of steps towards convergence. One way of addressing this issue is to modify the gradient descent algorithm to include momentum. Gradient descent algorithms with momentum ‘RMSProp’ and ‘Adam’ are all different variants which address this issue. Basically, instead of updating the parameters using only the gradients computed in the current iteration, a weighted average of the gradients from previous iterations as well is used for the update. Do you need gradient descent with momentum if you are told that your cost function is convex?
8. You are planning to consider using mini-batch gradient descent with momentum. Should you keep the momentum term larger or smaller during the initial part of the training? Explain the rationale behind your choice. 9. Consider the data set presented in question (4) about the Bengaluru traffic fine payment prediction problem. This data set has a number of features which are on different scales. Your friends suggest that you should normalise the inputs first. Can you explain the process of input normalisation? Does it improve prediction accuracy? If yes, explain how. If not, what is the reason behind normalising the inputs to a neural network? We all know that weights W in a feed forward neural network should not be initialised to zero. Can you explain why? Given this constraint, how would you decide the initialisation of the weights? Can you use any random values?
10. You have been asked to set up a deep learning system that can recognise faces of employees and send signals to open the entrance gate to the laboratory, as the employee approaches the gate. You have heard a lot about the advantages of end-to-end deep learning. In this case, if you decide to go with the end-to-end deep learning approach, you only provide the image of the employee approaching the door and the neural network provides an open/close signal output which can be used to drive the entrance gate. The alternative is to employ a pipeline approach, wherein the first phase identifies the ‘face part’ from the image and puts a bounding box around it. The next phase does the face match recognition to check if the approaching person is a validated employee of the company. Which approach would you choose and explain the rationale behind your choice.
If you have any favourite programming questions/ software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com.