started all this, and police arrested w0rmer aka Ochoa.
Maintaining Internet anonymity against a ubiquitous surveillor is nearly impossible.
If you forget even once to enable your protections, or click on the wrong link, or
type the wrong thing, you’ve permanently attached your name to whatever anonymous
provider you’re using. The level of operational security required to maintain privacy
and anonymity in the face of a focused and determined investigation is beyond the
resources of even trained government agents. Even a team of highly trained Israeli
assassins was quickly identified in Dubai, based on surveillance camera footage around
the city.
The same is true for large sets of anonymous data. We might naïvely think that there
are so many of us that it’s easy to hide in the sea of data. Or that most of our data
is anonymous. That’s not true. Most techniques for anonymizing data don’t work, and
the data can be de-anonymized with surprisingly little information.
In 2006, AOL released three months of search data for 657,000 users: 20 million searches
in all. The idea was that it would be useful for researchers; to protect people’s
identity, they replaced names with numbers. So, for example, Bruce Schneier might
be 608429. They were surprised when researchers were able to attach names to numbers
by correlating different items in individuals’ search history.
In 2008, Netflix published 10 million movie rankings by 500,000 anonymized customers,
as part of a challenge for people to come up with better recommendation systems than
the one the company was using at that time. Researchers were able to de-anonymize
people by comparing rankings and time stamps with public rankings and time stamps
in the Internet Movie Database.
These might seem like special cases, but correlation opportunities pop up more frequently
than you might think. Someone with access to an anonymous data set of telephone records,
for example, might partially de-anonymize it by correlating it with a catalog merchant’s
telephone orderdatabase. Or Amazon’s online book reviews could be the key to partially de-anonymizing
a database of credit card purchase details.
Using public anonymous data from the 1990 census, computer scientist Latanya Sweeney
found that 87% of the population in the United States, 216 million of 248 million
people, could likely be uniquely identified by their five-digit ZIP code combined
with their gender and date of birth. For about half, just a city, town, or municipality
name was sufficient. Other researchers reported similar results using 2000 census
data.
Google, with its database of users’ Internet searches, could de-anonymize a public
database of Internet purchases, or zero in on searches of medical terms to de-anonymize
a public health database. Merchants who maintain detailed customer and purchase information
could use their data to partially de-anonymize any large search engine’s search data.
A data broker holding databases of several companies might be able to de-anonymize
most of the records in those databases.
Researchers have been able to identify people from their anonymous DNA by comparing
the data with information from genealogy sites and other sources. Even something like
Alfred Kinsey’s sex research data from the 1930s and 1940s isn’t safe. Kinsey took
great pains to preserve the anonymity of his subjects, but in 2013, researcher Raquel
Hill was able to identify 97% of them.
It’s counterintuitive, but it takes less data to uniquely identify us than we think.
Even though we’re all pretty typical, we’re nonetheless distinctive. It turns out
that if you eliminate the top 100 movies everyone watches, our movie-watching habits
are all pretty individual. This is also true for our book-reading habits, our Internet-shopping
habits, our telephone habits, and our web-searching habits. We can be uniquely identified
by
Saxon Andrew
Christopher Grant
Kira Barker
Freya Robertson
Paige Cuccaro
Franklin W. Dixon
S.P. Durnin
Roberto Bolaño
John Domini
Ned Vizzini