Reading duration:5 min
2023-02-08
Data anonymization is an old concept in data privacy and data storage. But in 2017, it came back to the forefront and made quite some noise, thanks to the creation of the GDPR, which protects users' privacy and regulates data utilization.
In this article, we will define data anonymization, show why companies should put it in place, and explain how it could be achieved.
For more about the GDPR: gdpr-info.eu
The process by which personal data is altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party.
According to the GDPR, personal data is “any information relating to an identified or identifiable person”. In other words, personal data is any information that directly or indirectly identifies a person. It can therefore be a name, a city of birth, but also a customer number, an email address, etc.
You wouldn’t want a stranger looking into your bedroom through your window without your knowledge, would you? Now imagine that you left your ID, family photo album, and your first tooth neatly laid by the window. This mental image is as true as it is disturbing, judging by how much information can be collected about an individual from their online behavior. Privacy is not only valuable but is also a value in and of itself in today’s world. The right to privacy is the individual’s right to be protected by law from unwanted intrusion. With this in mind, The European Parliament and the EU council published the GDPR in 2016. The regulation went into effect and became law on May 25th,
Since 2018, we have been hearing more news about how European Union regulators slapped record-breaking fines, totaling billions of dollars, on tech sharks such as Google and Facebook for violating the GDPR through the way they used to collect and handle user data. Subsequently, companies began to alter their data-related practices to comply with the GDPR, which will ensure better and safer future products from tech companies for their users.
If we were to anonymize data to assist with GDPR compliance, then we must see to it that we do not violate the GDPR ourselves. The main question we should ask ourselves is how to anonymize data without breaking the GDPR rules, while at the same time keeping data value for meaningful utilization. It’s a tug-of-war game between two interests that may be at odds sometimes.
With the purpose of the task in mind, as well as the format of the data that will be anonymized and its particularities, one can then begin the process of analyzing and choosing which techniques will be used and to which attributes they will be applied. These are some techniques:
It consists of reducing the precision of the data: the values of the attributes are substituted by others semantically similar, but less specific.
In the example, the columns "Age" and "Address". The value of the individual's age has been transformed into an interval; for the address column, there has been a hierarchical abstraction of the value of the "Address" field, which instead of having the street address, number, and zip code, has only the name of the street address. These generalizations allow information to retain some level of analytical power and hide its exact value.
The process of converting a data set into a list of synthetic values. In other words, instead of a collection of several entries containing personal data, we transform it into new columns, which preserve the statistical properties of the database and hide the identity of the information carriers. The size of aggregated groups is very important to avoid having too few entries. For an attacker with enough information, a group with a single individual may contain information necessary for re-identification.
In the example, the donor information has been hidden by aggregating the monthly salary values into an interval, and the given total no longer identifies an individual but is the sum of each set of intervals.
This is a strategy based on replacing characters in the value of a column or attribute with symbols such as “*” or “x”. The masking is usually done partially on the attribute value, i.e. only part of the data is hidden. Thus, depending on its nature, the technique can be applied to a fixed quantity of characters (eg: credit card numbers) or to a variable quantity (eg: emails). To make this decision it’s necessary to analyze which part and size are most appropriate for masking so that the characters that remain visible do not make the re-identification possible.
For the Registration column, a fixed number of characters were hidden, while for the Email field, which has a variable size, a fraction of the string was chosen (in this case, ⅓). It’s important not to determine an exact number of characters in this case, as this can lead to the complete hiding of the information.
Probably one of the best-known anonymization techniques, pseudo-anonymity consists of replacing an identifier with false values. These values must be unique and unrelated to the original data. Pseudonyms can be generated randomly or deterministically, but the original data information is still completely lost. Thus, there is a great loss of utility. One attempt at preservation is to use persistent pseudonyms, in other words, to use the same pseudonym to identify the same individual in multiple databases.
Now that we’ve looked at anonymization techniques, let’s consider some opportunities that truly anonymized, synthetic data offers:
It is clear that Anonymization is one of the best ways to ensure the safety of the data you collect. This extra measure of security lets you freely exploit your data collection in ways that wouldn’t be legally allowed when it comes to non-anonymized data. However, there are also some considerable benefits of using personal data in its original form.