LibGuides: Research data management: Anonymization / pseudonymization

Introduction

The HES-SO recommends that all personal and sensitive data be anonymized as early as possible in the research process, throughout the duration of the project, or at the very least and in any case before the research data is valorized and deposited in a database.

If anonymization is not possible, then pseudonymization is recommended. For projects submitted to cantonal human research ethics commissions, stricter or more specific conditions may be imposed and must be respected.

Definition

Anonymization and pseudonymization of research data are operations aimed at modifying data sets by deleting or transforming personal data in order to prevent the possibility of identifying the people who are the subject of the research. These two operations aim to protect sensitive data, concerning people's identity, health, cultural practices, political/religious opinions or social affiliations. They fall within the scope of personal data protection laws (FADP/ GDPR).

Anonymization is an irreversible data transformation process. All identifying data (name, age, place of residence, etc.) is not recorded and therefore does appear in the data set. It is therefore impossible, even for the researchers who carried out the research, to associate a specific person with a data set.
Pseudonymization (or coding) prevents the direct identification of data that is grouped together in specific files. It is separated from the data used by the researchers, where it has been replaced by indirectly identifying data. Individual anonymity is guaranteed in the dataset, but re-identification is possible, as each respondent has been assigned an alphanumeric code which, if necessary, can be traced back using a table of correspondences.

The same processes are sometimes referred to under different names or terms:

Definitive anonymization (mentionned above as "anonymization")
Temporary or reversible anonymization (mentionned above as "pseudonymization)

Anonymization

"Anonymization is the processing of personal data using a set of techniques that make it practically impossible to re-identify a person by any means whatsoever."³

This is an irreversible operation, the consequences of which (necessary loss of information) must be measured. No longer identifying, anonymized data is therefore not subject to data protection laws (FADP/ GDPR). It can then be shared and reused without restriction, and stored for an unlimited period of time, provided that those responsible for data processing preserve the anonymous nature of the data produced over time.

To anonymize a data set, you can, for example, proceed as follows:

Randomization: modifying attributes in a dataset (by swapping individuals' birth dates, for example).
Generalization: grouping attributes in a dataset into classes (replacing dates of birth with age classes, for example)

The aim in both cases is to confuse the socio-geographical data, making it impossible to re-identify individuals by correlation.

Pseudonymization

"Pseudonymization is the processing of personal data in such a way that the data can no longer be attributed to an identified individual without further information."³

The operation consists of replacing directly identifying data (surname, first names, etc.) in a dataset with indirectly identifying data (an alphanumeric code, for example). This is a reversible operation, since the information removed from the dataset is grouped together in a separate document (correspondence table) that can be consulted to re-identify the data. Since it is always possible to re-identify survey participants, pseudonymized (or coded) data sets are considered personal data and are therefore subject to the Data Protection Act (DPA / GDPR), particularly with regard to the retention period and the possibility for data subjects to exercise their rights. The sharing and re-use of pseudonymized data sets are subject to authorization (in particular access to the file containing re-identifying data). Data controllers must maintain regulated access to the re-identification file (correspondence table) over time.

In order to pseudonymize a data set, a number of coding operations are performed:

By replacing identifying information (surname, first name) with random numbers.
By encrypting part of the data (which is then no longer readable in the absence of the encryption key).
Substituting data with more general information (e.g. replacing the commune with the canton), or information that is vague or false (but has no consequences for the analysis, e.g. assigning a pseudonym).

In all three cases, the aim is to confuse the socio-geographical data, so as to make it impossible to re-identify individuals without access to correspondence tables. It is therefore essential that files containing re-identification information should only be accessible to authorized persons and under strict and pre-specified conditions.

For a list of direct identifiers (from which a person can be immediately identified) and indirect identifiers (which can compromise data confidentiality if linked to other data sources), see the guideline published by Réseau Portage⁷.

Example of interview pseudonymization

Original text

"Last year I followed a couple from Afghanistan. The husband had Hodgkins cancer, which was tough. They already had two children and she was pregnant with twins. She bled during the pregnancy and had to be hospitalized. He was in the middle of treatment. Weakened but still at the refugee center. He couldn't really look after the children in his wife's absence. He slept a lot. The children were 3 years and 18 months old. She was under a lot of stress, because he was texting her and saying that things weren't going well. He was also feeling sick from the chemo. One day, we found the children in the Ikea shopping center opposite the refugee center ... they had crossed the main road on their own. The social service ..." (dummy example - situations of similar complexity in the data)

Pseudonymized text

"Recently I followed a couple from the Middle East. The husband had a chronic illness, it was hard. They already had two children and she was pregnant with twins. She had complications during the pregnancy and had to be hospitalized. He was ... weakened but still at the refugee center. He couldn't really look after the children in his wife's absence. He slept a lot. The children were under 4. She was under a lot of stress, because he was texting her and saying that things weren't going well .... .... One day, the children ran away and were found after a few hours ...they had crossed a main road on their own. The social service ...".

Source: Perrenoud (2021)

Tools

Human subjects data : dé-personnaliser les données à l'aide d'une liste d'identifiants directs et indirect.

More informations

References

Commission nationale de l'informatique et des libertés (2019). L’anonymisation des données, un traitement clé pour l’open data. https://www.cnil.fr/fr/lanonymisation-des-donnees-un-traitement-cle-pour-lopen-data
Commission nationale de l'informatique et des libertés (2020). L’anonymisation de données personnelles. https://www.cnil.fr/fr/lanonymisation-de-donnees-personnelles
Commission nationale de l'informatique et des libertés (2022). Recherche scientifique (hors santé) : enjeux et avantages de l’anonymisation et de la pseudonymisation. https://www.cnil.fr/fr/recherche-scientifique-hors-sante/enjeux-avantages-anonymisation-pseudonymisation
Groupe de travail « article 29 » (2014). Avis 05/2014 sur les Techniques d’anonymisation.https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_fr.pdf
Perrenoud, P. (2021). Rapport concernant les procédures et processus utilisés pour le partage de données qualitatives sur un DATA repository – Projet Open Data 2020 HES-SO. https://arodes.hes-so.ch/record/9174?ln=fr
Piquette, S. (2021). Anonymisation/pseudonymisation dans les projets de recherche [tutoriel]. Cycle de conférence sur l’éthique de la recherche et la protection des données personnelles. https://pod.univ-lille.fr/video/16591-s17-anonymisation-pseudonymisation/
Réseau Portage (2020). Directives sur la dépersonnalisation des données https://zenodo.org/record/4047176#.Ys5mUTfP02x

Anonymisation/pseudonymisation : guideline
Groupe de travail Guidelines de la Communauté Open Science HES-SO