‘Anonymised data’ is not entirely anonymous
Analysis from students at Harvard University shows that anonymisation is not the magic tool companies like to pretend it is.
Two Harvard students recently built a tool that combs through vast troves of consumer datasets exposed from breaches for a class paper they have yet to publish.
“The program takes in a list of personally identifiable information, such as a list of emails or usernames, and searches across the leaks for all the credential data it can find for each person,” they said.
They told Motherboard their tool analysed thousands of datasets from data scandals. Despite many of these datasets containing anonymised data, the students say that identifying actual users wasn't all that difficult.
Rohan Seth, a policy analyst at the The Takshashila Institution, said, “When we read the term anonymised data, we tend to believe that it cannot be used to identify people or whole communities. In a sense, we imagine the anonymisation to be irreversible, even though that assumption has long been debunked.”
“Think about it this way. If a malicious hacker has access to data from your Google Maps and a list of your UPI transactions, s/he does not necessarily need your name to identify you. Anonymised data sets are like puzzle pieces. If you combine enough of them, you could reverse the anonymisation process and identify the people it represents.”
Srinivas Kodali, an independent cybersecurity researcher, said, “There have been numerous reports that showed anonymised data can be de-anonymised for a while now. This report is just an addition. Anonymisation of data and usage of what is called homomorphic encryption to analyse
encrypted data to ensure privacy is not compromised are few new techniques to not allow employees to have access to large troves of data.”
“But if this data is shared, it indeed can be de-anonymised with other available data-sets. Sharing is the problem, anonymised or not.”