Masking Data to Protect It
Much of the debate being conducted these days on the efficacies of data collection for public good, primarily dealing with Aadhaar, rightly hinges on matters of privacy and the safety of the data collected. This has gained more traction after a software development engineer from Bengaluru, Abhinav Srivastava, was arrested on August 1 for allegedly hacking and accessing private data of individuals through the Unique Identification Authority of India’s ‘secure’ database.
Which prompts the notion that there must be a trade-off between data collection and data security. Well, that certainly need not be the case. But let’s first start with the important issue of the quality of data collected.
The more error-free the data the better, right? Wrong. Suppose you want to find out the percentage of students drinking alcohol in a hostel. If you ask, ‘Do you drink alcohol?’, it may be difficult for students to say ‘yes’ even if that is the correct answer. So, the surveyor may add a second simple question, such as ‘Were you born between January and June?’ Here, the probability of a ‘yes’ is 50%, and there is no reason of getting an incorrect answer.
The students can be asked to toss a coin (with 50% probability of ‘heads’) and to answer the first question if it’s ‘heads’, and the second question if it’s ‘tails’. It is expected that everyone will give the correct answer, as the outcome of the toss is unknown to others. So, the answer to the first question is masked by the coin flip and the second question. But now, the surveyor can easily guess the proportion of students who drink alcohol in the hostel.
A study with 400 students results in, say, 140 ‘yes’ answers. We can assume that around 200 students had ‘heads’, and the remaining 200 got ‘tails’. About half of the second lot of nearly 200 students are expected to be born during January-June. So, roughly100 ‘yes’ replies came from the second question. The remaining about 40 ‘yes’ answers are from roughly 200 students who have answered the first question. This means that about 20% of the students drink alcohol. It is also possible to estimate the amount of error in our calculations.
In this particular example, the survey could yield disastrous result without the mask of the second question, which obfuscated the student’s personal information. Obfuscating information provides idea on drinking habit of the students as a whole, and that is all we need. We cannot get the students’ personal information, we do not even need them.
This simple statistical example illustrates that we can get a lot of valuable information for society while ensuring the privacy of individuals. Suppose, for a complex and expensive surgery, there are options to choose between different available hospitals and doctors. Apart from the cost, the success rates of hospitals and doctors can certainly be very important criteria. Some hospitals might supply their success stories, sometimes only the number of cured people, not the total number of patients. In most of the cases, we do not have proper data in such context.
Well, what if the hospitals’ websites had the full story of treatments of all patients? On one hand, it would have been very bad, because it could hamper the privacy of the patients. So, instead of ditto information of each patient, the data can be provided by incorporating some suitably chosen ‘random error’, and the distribution of the ‘random error’ should also be mentioned. The personal information of the patient is completely hidden and collective data, rates of successful treatments of patients in different hospitals and by doctors, can be obtained. Different types of e-health records and financial information have already been obfuscated for other purposes. GoI needs to provide appropriate regulations and softwares to obfuscate different types of information, and to maintain a countrywide balance.
The choice of appropriate ‘random error’ is a delicate question. That collective information is properly obtainable from the obfuscated data should be ensured. But the original and the obfuscated data should differ in mathematical language, and the probability of a large difference between them should reasonably be high. And this obfuscation should be one way: it should be seen that the original individual information can never be retrieved from the obfuscated data.
Various government and private sector organisations can be brought under compulsory obfuscation, where publicising obfuscated data will not be detrimental to national security. Policymakers may decide and encourage on the issues where disclosure of obfuscating data in the public domain can be helpful to society.
The writers are professors, Indian Statistical Institute, Kolkata
All the relevant facts are there