Mask­ing Data to Pro­tect It

The Economic Times - - Breaking Ideas - Atanu Biswas & Bi­mal Roy

Much of the de­bate be­ing con­ducted th­ese days on the ef­fi­ca­cies of data col­lec­tion for pub­lic good, pri­mar­ily deal­ing with Aad­haar, rightly hinges on mat­ters of pri­vacy and the safety of the data col­lected. This has gained more trac­tion af­ter a soft­ware de­vel­op­ment en­gi­neer from Bengaluru, Ab­hi­nav Sri­vas­tava, was ar­rested on Au­gust 1 for al­legedly hack­ing and ac­cess­ing pri­vate data of in­di­vid­u­als through the Unique Iden­ti­fi­ca­tion Author­ity of In­dia’s ‘se­cure’ data­base.

Which prompts the no­tion that there must be a trade-off be­tween data col­lec­tion and data se­cu­rity. Well, that cer­tainly need not be the case. But let’s first start with the im­por­tant is­sue of the qual­ity of data col­lected.

The more er­ror-free the data the bet­ter, right? Wrong. Sup­pose you want to find out the per­cent­age of stu­dents drink­ing al­co­hol in a hos­tel. If you ask, ‘Do you drink al­co­hol?’, it may be dif­fi­cult for stu­dents to say ‘yes’ even if that is the cor­rect an­swer. So, the sur­veyor may add a sec­ond sim­ple ques­tion, such as ‘Were you born be­tween Jan­uary and June?’ Here, the prob­a­bil­ity of a ‘yes’ is 50%, and there is no rea­son of get­ting an in­cor­rect an­swer.

The stu­dents can be asked to toss a coin (with 50% prob­a­bil­ity of ‘heads’) and to an­swer the first ques­tion if it’s ‘heads’, and the sec­ond ques­tion if it’s ‘tails’. It is ex­pected that every­one will give the cor­rect an­swer, as the out­come of the toss is un­known to oth­ers. So, the an­swer to the first ques­tion is masked by the coin flip and the sec­ond ques­tion. But now, the sur­veyor can eas­ily guess the pro­por­tion of stu­dents who drink al­co­hol in the hos­tel.

A study with 400 stu­dents re­sults in, say, 140 ‘yes’ an­swers. We can as­sume that around 200 stu­dents had ‘heads’, and the re­main­ing 200 got ‘tails’. About half of the sec­ond lot of nearly 200 stu­dents are ex­pected to be born dur­ing Jan­uary-June. So, roughly100 ‘yes’ replies came from the sec­ond ques­tion. The re­main­ing about 40 ‘yes’ an­swers are from roughly 200 stu­dents who have an­swered the first ques­tion. This means that about 20% of the stu­dents drink al­co­hol. It is also pos­si­ble to es­ti­mate the amount of er­ror in our cal­cu­la­tions.

In this par­tic­u­lar ex­am­ple, the sur­vey could yield dis­as­trous re­sult with­out the mask of the sec­ond ques­tion, which ob­fus­cated the stu­dent’s per­sonal in­for­ma­tion. Ob­fus­cat­ing in­for­ma­tion pro­vides idea on drink­ing habit of the stu­dents as a whole, and that is all we need. We can­not get the stu­dents’ per­sonal in­for­ma­tion, we do not even need them.

This sim­ple sta­tis­ti­cal ex­am­ple il­lus­trates that we can get a lot of valu­able in­for­ma­tion for so­ci­ety while en­sur­ing the pri­vacy of in­di­vid­u­als. Sup­pose, for a com­plex and ex­pen­sive surgery, there are op­tions to choose be­tween dif­fer­ent avail­able hos­pi­tals and doc­tors. Apart from the cost, the suc­cess rates of hos­pi­tals and doc­tors can cer­tainly be very im­por­tant cri­te­ria. Some hos­pi­tals might sup­ply their suc­cess sto­ries, some­times only the num­ber of cured peo­ple, not the to­tal num­ber of pa­tients. In most of the cases, we do not have proper data in such con­text.

Well, what if the hos­pi­tals’ web­sites had the full story of treat­ments of all pa­tients? On one hand, it would have been very bad, be­cause it could ham­per the pri­vacy of the pa­tients. So, in­stead of ditto in­for­ma­tion of each pa­tient, the data can be pro­vided by in­cor­po­rat­ing some suit­ably cho­sen ‘ran­dom er­ror’, and the distri­bu­tion of the ‘ran­dom er­ror’ should also be men­tioned. The per­sonal in­for­ma­tion of the pa­tient is com­pletely hid­den and col­lec­tive data, rates of suc­cess­ful treat­ments of pa­tients in dif­fer­ent hos­pi­tals and by doc­tors, can be ob­tained. Dif­fer­ent types of e-health records and fi­nan­cial in­for­ma­tion have al­ready been ob­fus­cated for other pur­poses. GoI needs to pro­vide ap­pro­pri­ate reg­u­la­tions and soft­wares to ob­fus­cate dif­fer­ent types of in­for­ma­tion, and to main­tain a coun­try­wide bal­ance.

The choice of ap­pro­pri­ate ‘ran­dom er­ror’ is a del­i­cate ques­tion. That col­lec­tive in­for­ma­tion is prop­erly ob­tain­able from the ob­fus­cated data should be en­sured. But the orig­i­nal and the ob­fus­cated data should dif­fer in math­e­mat­i­cal lan­guage, and the prob­a­bil­ity of a large dif­fer­ence be­tween them should rea­son­ably be high. And this ob­fus­ca­tion should be one way: it should be seen that the orig­i­nal in­di­vid­ual in­for­ma­tion can never be re­trieved from the ob­fus­cated data.

Var­i­ous gov­ern­ment and pri­vate sec­tor or­gan­i­sa­tions can be brought un­der com­pul­sory ob­fus­ca­tion, where pub­li­cis­ing ob­fus­cated data will not be detri­men­tal to na­tional se­cu­rity. Pol­i­cy­mak­ers may de­cide and en­cour­age on the is­sues where dis­clo­sure of ob­fus­cat­ing data in the pub­lic do­main can be help­ful to so­ci­ety.

The writ­ers are pro­fes­sors, In­dian Sta­tis­ti­cal In­sti­tute, Kolkata

All the rel­e­vant facts are there

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.