Skip to main content

Research Data Management: Anonymize data

Anonymizing data

What we mean by ‘anonymizing data’ is not always clear. The expression can be used for data sets wherein records cannot be linked to an individual (person, organization or company). Another meaning is that the identifying information has been removed, but the means to re-identify individual(s) still exist.

Research subjects are believed to respond more truthfully when they know that their identity is not recorded. This is even more relevant when questions deal with sensitive issues, social taboos or illegal behaviour. Since it is in the interest of the researcher to ‘get to the truth’ there may an element of self-interest in providing anonymity in collecting primary data.

Equally important is the ethical principle that research subjects should benefit from the research in which they participate, or at least that the research should do them ‘no harm’. In particular, when data is shared, it will be necessary to design how data is to be de-identified before access is given to other researchers.

Considerations when anonymizing data

Offering anonymity to respondents is not a simple as you might think. When an initial survey needs to have a follow-up, the researcher may want to record personal details which are directly identifying (e.g. the classical information field: name, address, postcode, date of birth, driving license and social security numbers [1], but also their digital equivalents: IP addresses, GPS information and the MAC number of your mobile device [2])If you do not need personally identifiable information (PII), then don't include it in your data collection/survey. When these identifiers are kept in a database together with the other responses, this presents an inherent risk.
In some countries, lists of respondents can be subpoenaed in court for various reasons after which they become a matter of public record. It is advisable to keep directly identifying personal details separated from the rest of the data. Personal details are then replaced by a key/ code. Only the code is part of the database with data and the list of respondents/research subjects is kept separate.

This level of anonymizing may not be sufficient. When a research subject reports a unique combination of characteristics, the combination may also reveal the identity of the respondent/research subject.

[1] Dutch: Burger Service Numbers (BSN)
[2] MAC: Media Access Code, a unique number used when accessing the internet.

Related