Census Bureau seeks ways to protect privacy
Agency reconsiders whether protection it offers is strong enough
When the Census Bureau gathered data in 2010, it made two promises. The form would be “quick and easy,” it said. And “your answers are protected by law.”
But mathematical breakthroughs, easy access to more powerful computing, and widespread availability of large and varied public data sets have made the bureau reconsider whether the protection it offers Americans is strong enough. To preserve confidentiality, the bureau’s directors have determined they need to adopt a “formal privacy” approach, one that adds uncertainty to census data before it is published and achieves privacy assurances that are provable mathematically.
The census has always added some uncertainty to its data, but a key innovation of this new framework, known as “differential privacy,” is a numerical value describing how much privacy loss a person will experience. It determines the amount of randomness — “noise” — that needs to be added to a data set before it is released, and sets up a balancing act between accuracy and privacy. Too much noise would mean the data would not be accurate enough to be useful — in redistricting, in enforcing the Voting Rights Act or in conducting academic research. But too little, and someone’s personal data could be revealed.
On Thursday, the bureau will announce the tradeoff it has chosen for data publications from the 2018 Endtoend Census Test it conducted in Rhode Island, the only dress rehearsal before the actual census in 2020. The bureau has decided to enforce stronger privacy protections than companies like Apple or Google had when they each first took up differential privacy.
Hundreds of tables
Cynthia Dwork, the Gordon Mckay Professor of Computer Science at Harvard and one of the inventors of differential privacy, says it is “tailored to the statistical analysis of large data sets” — precisely the issue facing the census with its mandate from Title 13 of the U.S. Code to keep each person’s information private, and its responsibility to provide useful data.
At the root of the problem are the tables of aggregate statistics the bureau publishes. There are hundreds of tables — sex by age, say, or ethnicity by race — summarizing the population at several levels of geography, from areas the size of a city block all the way up to the level of a state or the nation. In 2010, the bureau released tables with nearly 8 billion numbers. That was about 25 numbers for each person living in the United States, even though Americans were asked only 10 questions about themselves. In other words, the tables were generated in so many ways the Census Bureau ended up releasing more data in aggregate then it had collected in the first place.
For the census, this is particularly worrisome, especially if a question about citizenship is added to the 2020 census, as the Trump administration has proposed. “I think it is crystal clear what the potential harm is from poorly protected tabular summaries,” said John Abowd, associate director for research and methodology at the Census Bureau, who became an early proponent of differential privacy.
In November 2016, the bureau staged something of an attack on itself. Using only the summary tables with their 8 billion numbers, Abowd formed a small team to try to generate a record for every American that would show the block where he or she lived, as well as his or her sex, age, race and ethnicity — a “reconstruction” of the person-level data.
Each statistic in a summary table leaks a little information, offering clues about, or rather constraints on, what respondents’ answers to the census could look like. Combining statistics from different aggregate tables at different levels of geography, we start to get a picture of the demographics of who is living where.
By this summer, Abowd and his team had completed their reconstruction for nearly every part of the country. When they matched their reconstructed data to the actual, confidential records — again comparing just block, sex, age, race and ethnicity — they found about 50 percent of people matched exactly. And for more than 90 percent there was at most one mistake, typically a person’s age being missed by one or two years. (At smaller levels of geography, the census reports age in fiveyear buckets.)
This level of accuracy was alarming. Abowd and his peers say their reconstruction, while still preliminary, is not a violation of Title 13. Instead it is seen as a red flag that their current disclosure limitation system is out of date.
The bureau has long had procedures to protect respondents’ confidentiality. For example, census data from 2010 showed that a single Asian couple — a 63yearold man and a 58yearold woman — lived on Liberty Island, at the base of the Statue of Liberty.
That was news to David Luchsinger, who had taken the job as the superintendent for the national monument the year before. On Census Day in 2010, Luchsinger was 59, and his wife, Debra, was 49. In an interview, they said they had identified as white on the questionnaire, and they were the island’s real occupants.
Before releasing its data, the bureau had “swapped” the Luchsingers with another household living in another part of the state, who matched them on key questions. This mechanism preserved their privacy, and kept summaries like the voting age population of the island correct, but it also introduced uncertainty into the data.
Swapping not enough
The bureau’s attack on itself showed that swapping wasn’t enough. Swapping focused on people who were isolated like the Luchsingers or who had characteristics that made them stand out in their neighborhood — the cells in the tables with only a single person.
On Thursday, the Census Bureau will reveal the details of applying differential privacy to its 2018 Endtoend Census Test, how it will control the level of noise in the summary tables to guarantee privacy.
The Census Bureau has been an early adopter of differential privacy. Still, instituting the framework on such a large scale is not an easy task, and even some of the big technology firms have had difficulties.
For example, shortly after Apple’s announcement in 2016 that it would use differential privacy for data collected from its macos and IOS operating systems, it was revealed that the actual privacy loss of their systems was much higher than advertised.
The Census Bureau “swapped” Debra and David Luchsinger’s information with another household living in another part of New York who matched them on some key questions.