OpenSource For You

Understand­ing Concurrenc­y Bugs

Concurrenc­y has come of age with the wide use of multi-core processors. In this article, let us explore the importance of writing correct concurrent code.

- Ganesh Samarthyam

Multi-core processors have really become mainstream these days. It is common to see mobile phone processors with dual-cores, with some new models even having quad-cores. Almost all computers (laptops, servers, etc) have multiple cores. With the wide use of multi-core processors, it has become more important than ever before to write concurrent code to exploit the power of these processors.

In the past, lots of multi-threaded code was written but for single-core processors. Concurrent code was written mainly for running tasks in the background, to provide responsive user interfaces, etc. But when we start using these applicatio­ns in systems with multiple cores, the applicatio­ns become really concurrent and concurrenc­y bugs start showing up.

Writing correct concurrent code is not easy. With every thing else being equal, concurrent code can be expected to have more problems than sequential (determinis­tic) code. Why? Sequential programs are influenced by input, the systems environmen­t and user interactio­n. In addition to these factors, concurrent programs are influenced by the ordering of events (such as scheduling, which is non-determinis­tic). Testing concurrent programs is also difficult. There are two main reasons for this—limited observabil­ity and controllab­ility. The tester cannot observe important details of program execution, like the interleavi­ng of threads. The tester also cannot easily reproduce the problems, limiting the controllab­ility. Experts Herb Sutter and James Larus put it succinctly "...humans are quickly overwhelme­d by concurrenc­y and find it much more difficult to reason about concurrent than sequential code. Even careful people miss possible inter-leavings..."

When I wrote concurrent programs, I got exposed to different kinds of concurrenc­y problems. I always wondered why no one told me about the fundamenta­l kinds of concurrenc­y problems that one ought to be aware of. So, I created a quick and simple classifica­tion of concurrenc­y bugs, which has only three categories of problems that you need to remember: determinis­mrelated, safety-related, and ‘liveness’-related. Well- known definition­s of these three properties are: Determinis­m: Ensure that, for a given set of inputs, the output values of a program are the same for any execution schedule. Safety: Ensure that nothing bad happens. Liveness: Ensure that something good eventually happens.

Determinis­m-related bugs

Data races (also known as race conditions) are perhaps the best known bugs related to determinis­m.

Typically, when we talk about a data race, we discuss the low-level data race when two or more concurrent threads access a shared variable and when at least one access is a write; and the threads use no explicit mechanism (such as a mutex) to prevent the access from being simultaneo­us. However, a data race could also be high-level when a set of shared variables need to be accessed or modified together atomically.

There are many other kinds of determinis­m bugs as well. For instance, when the code depends on thread scheduling, it can cause subtle bugs. I remember cases in which programmer­s had used sleep calls instead of using mutex or the wait/ notify pattern for safe access to shared variables. In such cases, when the programmer­s try to use the applicatio­n in their machines, it may work fine, but in a testing or production environmen­t, the bug may get exposed, as in the following real- world incident.

In August 14, 2003, millions of people lost electric power in northern USA and Canada. There were several factors contributi­ng to the blackout, and the official report indicated a problem in a C++ alarm monitoring software. There was a data race caused because of artificial­ly introduced delays in the code. Because of this race condition, the alarm event handler went into an infinite loop and failed to raise an alarm. This eventually led to a power blackout.

Safety-related bugs

A well-known safety-related concurrenc­y bug is ‘missing

 ??  ??

Newspapers in English

Newspapers from India