October 10, 2015:
There are no two opinions that Big Data analytics has the potential to reenergise businesses by inspecting mountains of data to provide insights into consumer behaviour. Researchers, too, have been mining massive online datasets for insights that can save lives, improve services and inform our understanding of the world.
On the flip side, though, governments and cyber criminals have access to these very datasets and have the capability to snoop on sensitive public data, especially that which may be generated by surfing the web, interacting with medical devices or from sensors.
Some data may be trivial, but in many cases, data are deeply personal, and can even influence our employers, tarnish our reputation, influence insurance premiums or even the price we pay for a product online.
Almost 15 years back, for instance, Latanya Sweeney—professor of government and technology in residence at Harvard University—led a team that uncovered the identities of patients, including the then Massachusetts governor William Weld, by correlating anonymized data with other publicly available data.
Using public anonymous data from the 1990 census, Sweeney found that 87% of the population in the US (then, 216 million of 248 million), could likely be uniquely identified by their five-digit ZIP code, combined with their gender and date of birth. A similar study was conducted by Philippe Golle of the Palo Alto Research Center.
In December, 2004, graduate student Arvind Narayanan and professor Vitaly Shmatikov—both from the Department of Computer Sciences at the University of Texas at Austin–claimed to have identified two people out of the nearly half million anonymized users whose movie ratings were released by online rental company Netflix in 2003. The company published the large database as part of its $1 million Netflix Prize that year.
Four years later, they applied their de-anonymization methodology to the Netflix Prize dataset, which contained anonymous movie ratings of 500,000 subscribers of Netflix to “demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset”.
Such revelations get compounded with the fact that many prominent websites, social networking sites, and even governments admitting to comprising personal information, and thus the privacy, of individuals. Telecoms, the world over, are known to have colluded with governments to part with sensitive public data, ostensibly culled for security reasons.
In India, which is yet to get a law to protect privacy, the Supreme Court is yet to take a final decision on the plea of the government and some regulatory bodies that are seeking the setting up of a larger bench to modify an earlier order that restricts the voluntary use of the Aadhaar card to the public distribution system (PDS) and LPG schemes. The use of the Unique Identitification Authority of India’s (UIDAI) Aadhaar number beyond the PDS and LPG schemes, has been challenged in court over privacy concerns since it uses biometric data like fingerprint and iris scans.
To address such privacy concerns, researchers like Salil Vadhan—a professor of computer science at Harvard University and former director of the Center of Research on Computation and Society—are exploring an approach known as ‘Differential Privacy’ that allows one to investigate data without revealing confidential information about participants.
Initially introduced by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith, among others, in the mid-2000s, researchers continue to develop the concept today to apply it for real-world problems.
As the lead researcher for the National Science Foundation (NSF) supported “Privacy Tools for Sharing Research Data,” Vadhan and his team at Harvard, according to a 7 October press statement, are developing a new computer system that acts as a trusted curator—and identity protector—of sensitive, valuable, data.
The Sloan Foundationand Google, Inc. are providing the project with additional support.
Researchers ask the virtual curator questions based on the data. For instance, “What percentage of individuals who have Type B blood are also HIV positive?” The computer returns an answer that is approximately accurate, but that includes just enough “noise” that no matter how hard someone tries, they cannot find out anything specific to any individual in the database.
“Even if an adversary tries to target an individual in the dataset, the adversary should not be able to tell the difference between the world as it is and one where that individual’s data is entirely removed from the dataset,” Vadhan said. “Randomization turns out to be very powerful.”
If the system is implemented simply, the level of privacy degrades with multiple queries, so one could keep asking questions until the point where identifying people in the database becomes possible. However, by judiciously increasing the amount of noise and carefully correlating it across queries, the system can maintain privacy protection, even in the face of very large number of questions, notes Vadhan.
Differential privacy has become a hot topic in recent years. A 2015 Science magazine article referred to differential privacy as one of the most promising technical solutions for protecting the data of students enrolled in Massive Open Online Courses (MOOCs). Projects including OnTheMap, used for US census data, RAPPOR, a new product from Google, apply forms of differential privacy for data sharing.
Harvard’s Institute for Quantitative Social Science, according to Vadhan, is planning to use differential privacy techniques to enable more researchers to share, retain control of, and credit for their data contributions as part of the Dataverse Network–a project that guarantees the long-term preservation of critical datasets.
Dataverse is the largest public general-purpose research data repository in the world. However, the scientific community could access far more datasets that are currently not publicly available, if differential privacy’s promise is fulfilled, according to Gary King, Albert J. Weatherhead III University Professor at Harvard University and Director of the Institute for Quantitative Social Science.
“That’s why we’re so thrilled to be working on this project,” King said in the press statement. “The social sciences are finally getting to the point in human history where we have enough information to move from studying problems to actually solving them. As we make progress on the privacy problem, we will be able to unlock more and more of the potential of this new information.”
The differential privacy tool Vadhan and his team are developing will allow the inclusion of datasets that were previously withheld because the information was too sensitive and privacy was uncertain.
Currently, Dataverse is not equipped to handle datasets with privacy concerns associated with them, according to Vadhan.
Differential privacy also doesn’t work for every type of research question. Vadhan pointed to regression, machine learning, and social network analysis as areas where there are very promising theoretical results, but challenges remain to making differential privacy work well in practice.
This article was first published on Livemint.com