Anonymity & Data Privacy
My entire life I've had limited privacy due to my unique first name. A Google Search of 'Avolyn' will return mostly me. While I value privacy, I didn't grow up assuming I had privacy. I couldn't create blogs online as a kid without my parents finding them. I once had a guy on the road find my Myspace page because my license plate at 16 years-old read, 'Avolyn' and that was all he needed to find me online (no, my license plate does not currently read my name).
Now that Facebook, GDPR, and shows like YOU on Netflix have brought the topic of Privacy front and center, everyone has growing concerns over their privacy. Most of us are realizing we have limited privacy. Some people have decided to stay off Facebook in an effort to maintain privacy, and as much as I hate to be a bearer of bad news, Facebook is still collecting data on you even if you don't have a Facebook page.
I recently attended the WiDS conference at Stanford and met a woman with an equally unique name and she lamented that her parents had given her a unique name, and that it was essentially as identifying as her social security number. I personally, don't take as much of an issue with that and I'm personally thankful that I'll never be confused with a random serial killer or otherwise notorious person with the same name as me.
Mark Zuckerberg recently published a 3,200-word essay, 'A Privacy-Focused Vision for Social Networking' and provides a shockingly inadequate definition of what he feels privacy means, 'gives people the freedom to be themselves.' With a less than adequate stance on Data Privacy from someone who has a bigger impact on our privacy than most other influences, it all begins to look rather bleak.
Where does this leave us? For many, the obvious answer is anonymity. But do we even have that?
In 2009, Netflix held a competition with a grand prize of $1,000,000 to whoever could improve their recommendation algorithm. In 2010 the competition was closed after researchers found that the data could be traced back to real people. A similar story happened when AOL attempted to release anonymous search history data. Even the 2010 US Censuswas plagued with privacy issues when it was discovered that their old data shielding techniques were not sufficient to protect the identity of individuals.
If you work in a data field, I am sorry to say that most of the work of protecting this data falls on you. It's a scary realization and a responsibility and burden that almost no one wants to bear, but someone has to. We have to take ownership of the task of protecting data if we interact with it on a daily basis.
Since GDPR guidelines were announced, companies have been scrambling to put together sufficient workflows to manage these new regulations. Piwik published this 'Ultimate Guide to Data Anonymization in Analytics' with techniques as well as basic guidelines on which forms of data need to be protected. But unfortunately, statistics prove that even the most vague data sets have vulnerabilities that make personalized data traceable.
A study found that 87% of all Americans can be identified with only three pieces of information: Zip Code, Birthday, and Gender. That's it.
Of course the goal of me writing this wasn't to leave you in the pits of despair, so I will finish on the note of solutions and the topic of Differential Privacy. In its simplest form, think of Differential Privacy as an algorithm that analyzes a dataset and computes statistics about it (mean, median, variance, mode, etc). The output of this algorithm is then deemed to be private if by looking at the output, one cannot tell whether any given individual's data was included in the original dataset or not.
This groundbreaking work was the result of years of hard work from the following individuals who spent years applying algorithmic ideas to the study of privacy: Dinur and Nissim `03; Dwork and Nissim `04; Blum, Dwork, McSherry, and Nissim `05), culminating with work of Dwork, McSherry, Nissim, and Smith `06 at Harvard.
Looking at the mathematical formulas for a moment, here is the formal definition:
A randomized function K gives ε-differential privacy if for all data sets D and D′ differing on at most one row, and all S ⊆ Range(K)
An attacker or data analyst should therefore not be able to learn any information about any participant that they could not learn if the participant had opted out of the database.
From here we can apply various mechanisms, and mechanism design becomes its own sub-topic within Differential Privacy. With Laplace Mechanism and Exponential Mechanismbeing the two most common forms of mechanisms applied to Differential Privacy.
The Laplace Mechanism adds random noise to our data dependent on the scale of our data, the nature of the query, and the risk to the most different individual of having their private information identified within the data.
The sensitivity of our query can thus be defined as:
This formula is the maximum difference in values that our query may take on a pair of databases that differ by only one row, or said another way, the maximum amount our data would differ if we removed any given individual. You can find a study that was published on this function, that proved privacy is guaranteed through using this method, here.
However, this method of using algorithms to add privacy to our data is not unique to a single dataset or research being done for Kaggle competitions. It can be done at the database level within the data being housed in organizations. These methods are currently being used by Google, Uber, Apple, Microsoft, and the Census Bureau.
How do we move Differential Privacy from Static Databases to Growing Databases?
As Dr. Rachel Cummings outlines in this talk she gave in 2018, there is a way to embed this logic into real-time data that is being stored in an organizational database. Historically, nearly everything that had been done in this space, was done on a static database, and couldn't be applied to a growing database. Her methods involve applying Private Multiplicative Weights through adaptive linear queries for growing databases.
This bridges the research originally done by Cynthia Dwork and her peers and makes it real and applicable for organizations handling massive amounts of growing data. Moving away from the old ways of doing data science to new ways that provide data privacy to individuals included in our databases.
Who decides when data is private enough?
Ultimately, if you work with data, that falls on your shoulders. I think the first step is getting honest with yourself and the level of protections you've put in place, or the areas that could be improved.
Here are a few simple questions you can ask to get started:
What sensitive and personal information do we host within our organization?
What consequences would transpire if our data were to be compromised?
And most importantly, following the logic behind Differential Privacy:
What can we learn about a single record within our database (or dataset) that couldn't be learned by viewing the statistical output of our data?
Here in lies the opportunity to ensure that the output being generated by our queries and data analysis is 'fuzzy' enough so as not to reveal anything that couldn't be learned by looking at our data at a glance.