OpenSource For You

CodeSport

In this month's column, we begin a discussion on ‘data science' and also feature a set of systems software interview questions.

-

Let’s continue with another set of interview questions for systems software engineers. But before we get to the interview practice questions, I wanted to update you on what I plan to discuss in our column over the next few months. Many of our readers have written to me with requests to discuss topics from one of the emerging areas of informatio­n technology, namely data science.

Data science is being seen as the next big wave in the software industry. Unlike the developmen­ts in computer architectu­re or operating systems, this is a change that is going to have a direct impact not just in enterprise business technologi­es, but also in the way informatio­n is consumed by individual­s. So I wanted to start off this month’s column with a brief introducti­on to data science. Before we go into what it is, let us talk about why data has become the most significan­t commodity in informatio­n technology.

We produce 2.5 quintillio­n bytes of informatio­n every day. This informatio­n is generated by our Web searches, the online purchases we make, our mobile calls and our social network presence (a quintillio­n is 1000 X 1000 X 1000 times a billion). Given the volume, velocity, variety and variabilit­y of data that gets produced, expertise is needed to make sense out of the vast quantities of data, in order to derive meaningful, actionable informatio­n from it. Indeed, this has given rise to a new breed of computer scientists known as ‘data scientists’. They are a rare breed since data science itself is a discipline in its infancy. In fact, a McKinsey report from May 2011 predicts a huge shortage of data scientists, estimating that in the US alone, more than 190,000 data scientists would be needed.

Even today, we frequently encounter many ‘data science’ applicatio­ns in our daily interactio­ns. The customer reviews and product recommenda­tions you see on Amazon are an example of the use of data science. Customer reviews are opinions mined to identify and compare different features of the product under discussion and presented in a concise way to the consumer. Mining opinion reviews for positive and negative sentiments involve different elements of data science such as natural language processing, machine learning and analytics. Product recommenda­tions are based on ‘recommende­r systems’, which use the profile data associated with customers to identify potential products that they may be interested in. These data science applicatio­ns deal mostly with unstructur­ed data, unlike traditiona­l enterprise or e-commerce applicatio­ns, which typically deal with structured data housed in databases.

The main difference between traditiona­l data applicatio­ns involving databases and the newer breed of data science applicatio­ns is the role of unstructur­ed data, and the fact that multiple data sources are combined, mined and analysed to get actionable informatio­n that can be used to drive other applicatio­ns. As O’Reilly puts it, “Data is the next Intel inside.” Given the vast volumes of data around us, there are infinite possibilit­ies for analysing and processing them. For instance, during the swine flu epidemic of 2009, Google could build a visual trail of the disease’s spread across the different states of the US by correlatin­g the search history and frequency of users ‘googling’ for swine flu related topics.

‘Data science’ requires multi-disciplina­ry skills in a number of areas. These include probabilit­y, statistics, informatio­n retrieval, Web search, analytics, natural language processing, data mining and machine learning. It’s only now that universiti­es are waking up to offering courses on data science. There are not many books that provide a comprehens­ive overview of the subject. Given the interest expressed by many of our readers and its importance to budding

programmer­s, I am planning to cover data science in this column over the next few months.

In the next few columns, we will be looking at various topics in informatio­n retrieval, machine learning, data mining and natural language processing with a specific focus on the programmin­g tools and algorithmi­c techniques in these areas. I plan to discuss R programmin­g language, PANDAS (python data mining), NLTK (Python natural language tool kit) and SciKit. Data science also requires us to refresh some of our mathematic­al knowledge in probabilit­y and statistics and, hence, we would also be discussing the essential mathematic­al background needed for data science.

This month’s technical interview questions

1. In a kernel module that you have written, a spin lock is getting acquired. When the lock is held, an interrupt comes in, which is getting serviced on the same processor that your kernel module is running on. Now, if the interrupt handler tries to acquire the same spin lock, what would happen? How can this situation be avoided? 2. There are a number of processes executing on the Linux system. When an out of memory condition is encountere­d, the kernel makes the decision to kill one or more of the processes to free up memory. How would you ensure that your process does not get targeted for a ‘kill’ by the ‘OOM’ handler? 3. The ‘mkfs’ command is used to build a Linux file system on a device, typically a hard-disk partition. Now, while the ‘mkfs’ command is running, you made the underlying hard disk ‘write-protected’. What do you expect will happen? 4. In the Linux kernel, what is CPU hard lock-up and what is CPU soft lock-up? How does the Linux kernel detect these situations? When are such situations encountere­d? 5. What are lockless data structures? How would you implement a lockless list that supports the insert, delete and search operations from multiple threads? 6. What is meant by the voltage frequency scaling of

processors? Where is it used? 7. Can you explain what exactly happens when you type a search keyword in the Google search bar? Please give a step by step descriptio­n of all the events that occur till

the search results are returned? 8. Can you explain how an anti-virus scanner works? 9. What is the difference between ‘allocate on write’ and ‘update in place’ file systems? Can you give an example for each? 10. You are given access to a Linux kernel environmen­t in which you need to run your applicatio­n. You need to write a script/program to determine whether your applicatio­n is running inside a Linux KVM (kernel virtual machine). How would you do this?

My ‘must-read book’ for this month

This month’s suggestion comes from one of our readers, Ramya, whose recommenda­tion is the book, ‘Chances are -– Adventures in Probabilit­y’ by Michael Kaplan and Ellen Kaplan. According to Ramya, “Though this is not a computer science book, a firm foundation in probabilit­y is important for all computer science students and this is one of the best books for a layman’s introducti­on to probabilit­y.” Thank you, Ramya, for your recommenda­tion.

If you have a favourite programmin­g book/article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short writeup on why you think it is useful so I can mention it in the column. This would help many readers who want to improve their software skills.

If you have any favourite programmin­g questions/ software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programmin­g and here’s wishing you a great and happy 2014!

 ??  ?? Sandya Mannarswam­y
Sandya Mannarswam­y

Newspapers in English

Newspapers from India