Differential privacy: anonymizer noise
Written by
Lahis Kurtz (See all posts from this author)
2 de March de 2020
Population statistics can be reversed into personal data, and differential privacy is a way to avoid this.
Re-identification and privacy
When you look at a database, it looks impersonal. At first glance, it would look like just a bunch of numbers, which tells us something about a group, without saying anything about a specific person in that group. The individuals would be anonymous.
But stating this would be tantamount to stating that a book is just a bunch of words. As with logic problems, in which we are given a series of apparently disconnected information that together reveal situations and characteristics, in the databases it is possible to combine information to discover something about singular individuals. This is the concept of re-identification.
For example, let’s say that we know a Maria, who works in a certain public office. Now, suppose that on the website of that division there is a table with the total of men and women who work in that division, and that there are only 3 women. In another database, we found that 3 women in that division took maternity leave that year. Apparently, none of these databases are about Maria. But looking at these two supposedly anonymous databases that don’t contain specific information about anyone, we ended up discovering something about them. We can say that Maria has one or more children, potentially a baby or a child at home.
This was a simple example, but with all the algorithms and sophisticated correlation analysis systems available, logic problems became much easier to solve.
Forming or publishing any database that involves groups of people requires prior care. It may seem harmless to Maria to be identified as a mother in the context of a database, but the truth is that this can lead to unwanted situations. She may be the receive targeted advertising, or suffer discrimination when applying for jobs. Like this, several other personal conditions, which would not be of public interest or access and which leave the person vulnerable in various situations, may be exposed through the re-identification of statistical data.
At the same time, statistical data is necessary and has gained a lot of value in market studies and in the elaboration of public policies. For this reason, a way of dealing with the problem of re-identification was developed, maintaining the usefulness of statistics: differential privacy – one of the 10 most promising technologies of 2020, according to an MIT list.
Anonymizer noise
Recognized for its anonymizing potential, differential privacy is formally defined as follows:
In practice, its application is nothing more than the insertion of (manufactured) information in real databases, to make it difficult to re-identify individuals who participate in that statistic.
If we stop to think for a moment, it is easy to see what problems can result from this idea: no one wants to have statistical data that says, for example, that most of the population has the characteristic x if, in fact, the majority has the y characteristic. An incorrect statistic is useless and can cause many problems. Therefore, the method also involves taking care that, when entering this new information, it does not result in a significant change in the statistics provided by the database.
That is, as if the databases were several voices talking at the same time, and the differential privacy added noise avoiding knowing what is being said by whom. But the noise cannot overlap the lines, prevailing and avoiding the possibility of understanding them.
Thus, when removing or adding data from the bank where differential privacy was applied, the information it contains is still valid. Mathematical noise cannot be more important than the data to which it is mixed. There is also a concern to optimize the use of this noise, maximizing the veracity of the database without increasing the risk of re-identification. Some technical problems involving this are still discussed, as for example in this dissertation, which suggests ways to apply the technique to correlated data without producing more noise than necessary.
The importance of anonymization
A growing number of databases on populations and groups are accessible or potentially available to the public. And we are participating in more and more of these databases, made by various social actors, such as governments and private companies, which collect information about us and our behavior.
The statistics generated are serving as a form of transparency, a source of research and information. Censuses, demographic and behavioral surveys can provide valuable information for us to understand more about ourselves as people and groups of people.
The technologies available more than ever are enabling discoveries and new discussions about communities and diversity. However, we must be careful that they are not used as a way to profile or potentially discriminate against individuals. In this context, differential privacy is an extremely relevant technique to maintain the balance between data protection and access to information rights.
If you are interested in the subject of privacy and treatment of personal data, access here the video produced by IRIS on the subject.
The views and opinions expressed in this article are those of the authors.
Illustration by Freepik
Written by
Lahis Kurtz (See all posts from this author)
Head of research and researcher at the Institute of Research on Internet and Society (IRIS), PhD candidate at Law Programme of Federal University of Minas Gerais (UFMG), Master of Law on Information Society and Intellectual Property by Federal University of Santa Catarina (UFSC), Bachelor of Law by Federal University of Santa Maria (UFSM).
Member of research groups Electronic Government, digital inclusion and knowledge society (Egov) and Informational Law Research Center (NUDI), with ongoing research since 2010.
Interested in: information society, law and internet, electronic government, internet governance, access to information. Lawyer.