CodeSport
In this month's column, we begin a discussion on ‘data science' and also feature a set of systems software interview questions.
Let’s continue with another set of interview questions for systems software engineers. But before we get to the interview practice questions, I wanted to update you on what I plan to discuss in our column over the next few months. Many of our readers have written to me with requests to discuss topics from one of the emerging areas of information technology, namely data science.
Data science is being seen as the next big wave in the software industry. Unlike the developments in computer architecture or operating systems, this is a change that is going to have a direct impact not just in enterprise business technologies, but also in the way information is consumed by individuals. So I wanted to start off this month’s column with a brief introduction to data science. Before we go into what it is, let us talk about why data has become the most significant commodity in information technology.
We produce 2.5 quintillion bytes of information every day. This information is generated by our Web searches, the online purchases we make, our mobile calls and our social network presence (a quintillion is 1000 X 1000 X 1000 times a billion). Given the volume, velocity, variety and variability of data that gets produced, expertise is needed to make sense out of the vast quantities of data, in order to derive meaningful, actionable information from it. Indeed, this has given rise to a new breed of computer scientists known as ‘data scientists’. They are a rare breed since data science itself is a discipline in its infancy. In fact, a McKinsey report from May 2011 predicts a huge shortage of data scientists, estimating that in the US alone, more than 190,000 data scientists would be needed.
Even today, we frequently encounter many ‘data science’ applications in our daily interactions. The customer reviews and product recommendations you see on Amazon are an example of the use of data science. Customer reviews are opinions mined to identify and compare different features of the product under discussion and presented in a concise way to the consumer. Mining opinion reviews for positive and negative sentiments involve different elements of data science such as natural language processing, machine learning and analytics. Product recommendations are based on ‘recommender systems’, which use the profile data associated with customers to identify potential products that they may be interested in. These data science applications deal mostly with unstructured data, unlike traditional enterprise or e-commerce applications, which typically deal with structured data housed in databases.
The main difference between traditional data applications involving databases and the newer breed of data science applications is the role of unstructured data, and the fact that multiple data sources are combined, mined and analysed to get actionable information that can be used to drive other applications. As O’Reilly puts it, “Data is the next Intel inside.” Given the vast volumes of data around us, there are infinite possibilities for analysing and processing them. For instance, during the swine flu epidemic of 2009, Google could build a visual trail of the disease’s spread across the different states of the US by correlating the search history and frequency of users ‘googling’ for swine flu related topics.
‘Data science’ requires multi-disciplinary skills in a number of areas. These include probability, statistics, information retrieval, Web search, analytics, natural language processing, data mining and machine learning. It’s only now that universities are waking up to offering courses on data science. There are not many books that provide a comprehensive overview of the subject. Given the interest expressed by many of our readers and its importance to budding
programmers, I am planning to cover data science in this column over the next few months.
In the next few columns, we will be looking at various topics in information retrieval, machine learning, data mining and natural language processing with a specific focus on the programming tools and algorithmic techniques in these areas. I plan to discuss R programming language, PANDAS (python data mining), NLTK (Python natural language tool kit) and SciKit. Data science also requires us to refresh some of our mathematical knowledge in probability and statistics and, hence, we would also be discussing the essential mathematical background needed for data science.
This month’s technical interview questions
1. In a kernel module that you have written, a spin lock is getting acquired. When the lock is held, an interrupt comes in, which is getting serviced on the same processor that your kernel module is running on. Now, if the interrupt handler tries to acquire the same spin lock, what would happen? How can this situation be avoided? 2. There are a number of processes executing on the Linux system. When an out of memory condition is encountered, the kernel makes the decision to kill one or more of the processes to free up memory. How would you ensure that your process does not get targeted for a ‘kill’ by the ‘OOM’ handler? 3. The ‘mkfs’ command is used to build a Linux file system on a device, typically a hard-disk partition. Now, while the ‘mkfs’ command is running, you made the underlying hard disk ‘write-protected’. What do you expect will happen? 4. In the Linux kernel, what is CPU hard lock-up and what is CPU soft lock-up? How does the Linux kernel detect these situations? When are such situations encountered? 5. What are lockless data structures? How would you implement a lockless list that supports the insert, delete and search operations from multiple threads? 6. What is meant by the voltage frequency scaling of
processors? Where is it used? 7. Can you explain what exactly happens when you type a search keyword in the Google search bar? Please give a step by step description of all the events that occur till
the search results are returned? 8. Can you explain how an anti-virus scanner works? 9. What is the difference between ‘allocate on write’ and ‘update in place’ file systems? Can you give an example for each? 10. You are given access to a Linux kernel environment in which you need to run your application. You need to write a script/program to determine whether your application is running inside a Linux KVM (kernel virtual machine). How would you do this?
My ‘must-read book’ for this month
This month’s suggestion comes from one of our readers, Ramya, whose recommendation is the book, ‘Chances are -– Adventures in Probability’ by Michael Kaplan and Ellen Kaplan. According to Ramya, “Though this is not a computer science book, a firm foundation in probability is important for all computer science students and this is one of the best books for a layman’s introduction to probability.” Thank you, Ramya, for your recommendation.
If you have a favourite programming book/article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short writeup on why you think it is useful so I can mention it in the column. This would help many readers who want to improve their software skills.
If you have any favourite programming questions/ software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming and here’s wishing you a great and happy 2014!