The European Data Protection Board (EDPB) adopted Opinion 28/2024 on 17 December 2024 on the basis of Article 64(2) of the General Data Protection Regulation (GDPR). This opinion addresses data protection concerns related to AI models. Requested by the Irish Data Protection Commission (DPC) in September 2024, the opinion offers non-exhaustive guidance on interpreting GDPR provisions when training and deploying AI models. It follows the Report of the work undertaken by the ChatGPT Taskforce adopted on 23 May 2024.
Interestingly, reading Opinion 28/2024, it is possible to draw a series of observations for the anonymisation of personal data. The purpose of this blog post is to highlight and discuss these observations.
The DPC triggered Opinion 28/2024 by raising four main questions, which go beyond aspects related to the appropriate legal basis for personal data processing in the context of AI models:
1) When is an AI Model considered to be anonymous?
2) How can controllers demonstrate that they meet the test for the legitimate interest legal basis when they create, update and/or develop an AI Model?
3) How can controllers demonstrate that they meet the test for the legitimate interest legal basis when they deploy an AI Model?
4) What are the consequences of an unlawful processing of personal data during the development phase of an AI model upon subsequent phases, such as deployment?
In this blog post we will mainly focus upon the EDPB’s answer to question 1 aiming to shed light on its position regarding the anonymisation of personal data in general.
While the EDPB’s guidelines on anonymisation and pseudonymisation have been included in the EDPB’s work programme for several years now (see e.g., the EDPB’s work programme for 2021/2022 which foresees one set of guidelines and the EDPB’s work programme for 2023/2024 which foresees two sets of guidelines), no output has been released yet and it is likely that no output will be released in the next few months.
Are AI Models Personal Data?
Before grabbing the highlighter and digging into Opinion 28/2024, let’s recall that a discussion paper from the Hamburg Commissioner for Data Protection and Freedom of Information (HmbBfDI) on Large Language Models (LLMs) and the GDPR had, over the summer, intensified the debate about whether the storage of LLMs amounts to personal data processing under the GDPR (See also prior guidance from the Danish DPA and the report of a technologist roundtable run by the Future of Privacy Forum). The HmbBfDI discussion paper had indeed been released after a research team had shown that “an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT” and therefore that “practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorisation.” As a reminder, memorisation is generally understood as a phenomenon enabling an adversary to query a machine learning model without prior knowledge of its training data and extract a portion of that data.
Given the formulation of the DPC’s first question which aims to determine whether an AI model that has been trained “meet the definition of personal data as set out in Article 4(1)” GDPR, the EDPB in its 28/2024 Opinion is invited to chip in the debate intensified by the HmbBfDI. To answer question 1, the EDPB rephrases it slightly stating that it “can be answered by analysing if an AI model resulting from training which involves processing of personal data should, in all cases, be considered anonymous.” (para. 28).
The EDPB draws two conclusions. First, adopting what seems to be a subjective approach and considering designers’ intention, it states that for AI models that have been “specifically designed to provide personal data regarding individuals whose personal data were used to train the model, or in some way to make such data available,” the models include personal data. (para. 29). Indeed, in some cases memorisation is considered beneficial as it enables the model to recall accurate information useful for applications requiring precision.
Second, and for other AI models, the EDPB stresses the possibility of an AI model memorising training data including personal data. Personal data in this case is said to be “absorbed in the parameters of the model, namely through mathematical objects.” (para. 31). The EDPB’s conclusion is pretty clear: whenever memorisation may happen, the model may not be anonymous and therefore its storage may amount to the processing of personal data. (para. 31). The use of the auxiliary may is important. What this means is that ultimately the answer to question 1 is case specific.
The EDPB thus appears to depart from the position of the HmbBfDl. Information that is not organised in a way that makes the relationship with an individual apparent can amount to personal data. (para. 37). Note that the HmbBfDl had opined that information stored in LLMs is not associated with individuals and therefore should not be considered information “relating” to a natural person, i.e., personal data. In addition, the EDPB is much more careful with its definition of situationally relevant attackers. For the EDPB, it is clearly not satisfactory to simply note that “[g]enerally, designing and executing effective privacy attacks on LLMs require substantial technical expertise and time resources that the average user lacks,” as the HmbBfDl did. What is more, the EDPB makes the characterisation of the likelihood of both probabilistic extraction and query extraction key considerations for determining whether an AI model can be deemed anonymous.
From AI Model Anonymity to Personal Data Anonymisation: What Are the Main Takeaways of Opinion 28/2024?
What is this EDPB’s response telling us about its interpretation of the test for personal data anonymisation under the GDPR, which is still very much in the making?
As a starting point, let’s recall that the legal test is for personal data anonymisation should derive from a combined interpretation of Article 4(1) GDPR, which defines personal data, and Recital 26 GDPR, which specifies how identifiability should be assessed. Data protection experts thus usually refer to the ‘means’ test and look for “the means reasonably likely to be used, such as singling out, either by the controller or by another person” to determine whether the data controller or another person is in a position to identify one or more data subjects.
It is possible to draw five important observations from Opinion 28/2024.
First, the EDPB makes it clear, as we have argued elsewhere, that Opinion 05/2014 on Anonymisation Techniques embeds a two-prong test and the prongs are actually alternative. In other words, to determine whether anonymisation is achieved controllers have two options:
1) Option 1 anonymisation: to demonstrate that that three re-identification risks (singling out, linkability, and inference) are all mitigated
2) Option 2 anonymisation: “whenever a proposal does not meet one of the [3] criteria, a thorough evaluation of the identification risks should be performed” (para. 40 which refers back to WP29 Opinion 05/2014 on Anonymisation Techniques, p. 24)
In other words, the EDPB seems to be saying that anonymisation is not impossible under the GDPR simply because singling out, linkability or inference risks are not all mitigated, but rather that a further risk evaluation should be carried out by the data controller (see para. 40). This aligns with the external guidance on the implementation of the European Medicines Agency policy 0070 on the publication of clinical data for medicinal products for human use (a new version of the external guidance will be published in early 2025). This may however suggest that some existing interpretations are too restrictive. For example in the THIN database casethe Italian DPA only considered Option 1 and, incidentally, rejected k-anonymisation (a privacy-preserving method used to protect sensitive data by ensuring that each individual in a dataset is indistinguishable from at least k−1 other individuals with respect to certain attributes deemed to be quasi-identifiers) when each individual record is given a unique progressive code, which seemed to suggest that for the Italian DPA data in a k-anonymised state should never be considered anonymised under the GDPR.
Second, in terms of the means reasonably likely to be used to identify a data subject, the EDPB stresses the importance of not focusing only upon the means available to the intended recipient of the information (or the model) to evaluate the risks of re-identification but of widening the net to consider unintended third parties as well (para. 42). This is particularly important in the light of EU case law, which up until now has not been particularly explicit regarding the range of relevant ‘other persons’ for the purposes of applying the ‘means’ test found in Recital 26 GDPR. In Breyer, for example, the CJEU only seems to be taking into account a potentially intended holder, i.e., the Internet Service Provider holding subscriber data, which could eventually be combined with dynamic IP addresses stored by the online service provider when attacks on the website occur. In Scania, the CJEU only seems to be considering intended recipients of the Vehicle Information Number (VIN), i.e., independent operators communicating with car manufacturers who are expected to combine VINs with identifying information. In IAB Europe, the CJEU focuses upon intended holders in the form of members of IAB Europe. And, in SRB, which has been rightly appealed, the General Court seems to be only interested in the situation of the intended recipient, i.e., Deloitte.
Third, the EDPB draws an important distinction between information, including AI models, that is publicly available and information that is not publicly available. The EDPB is thus saying that depending upon the release setting of the information at stake, (i.e., open or closed settings) supervisory authorities will have to consider “different levels of testing and resistance to attacks.” (para. 46). The EDPB thus indirectly confirms the relevance of context controls, i.e., controls that do not transform the data as such, but impact its environment so that the likelihood of attacks is reduced. Of note, from an AI model perspective, such a distinction does not necessarily imply that the making of AI models publicly available would be unlawful. EU data protection law does not prohibit the making available of personal data that is already publicly available as long as the controller is in a position to demonstrate compliance with the full of set of data protection obligations and rights, which are not absolute.
Fourth, the EDPB makes it very clear that anonymisation processes are governed by data protection law and that the principle of accountability (Article 5(2) GDPR) is particularly relevant in this context. Appropriate documentation should therefore be produced to evidence the technical and organisational measures taken to reduce the likelihood of identification and demonstrate their effectiveness, which may include quantification of risks through the use of relevant metrics, as well as to describe the roles of the stakeholders involved in the data flows, This may explain why the decision of the Italian DPA in the THIN database case could still make sense, as the burden to perform a thorough evaluation of the risks of re-identification is born by the data controller and if such an evaluation is not rigorous enough, the latter is left with Option 1 anonymisation. Yet, the entity determining both the anonymisation means and the purposes of reuse (which had refused to view itself as a data controller) was arguing in this case that the anonymisation process aligned with the external guidance of the European Medicines Agency mentioned above and, that, therefore a thorough evaluation of re-identification risks had been performed as evidenced in a report submitted to the DPA.
Fifth, and this is probably one of the most contentious parts, the EDPB seems to opine that if personal data is processed unlawfully and then successfully anonymised, the GDPR does not apply to the anonymised data (para. 134). The EDPB is clearly concerned about the fate of AI models developed in breach of applicable data protection rules and, responding to the HmbBfDl, seems to be suggesting controllers a way to regularise the situation. However, as anonymisation should be conceived as being fundamentally contextual, meaning the status of the data, including models, could evolve depending upon who is holding and shielding it, it is not clear whether such an option will often be viable. To convey the message that AI models produced out of unlawful processing may survive scrutiny, it would probably have been sufficient to mention that supervisory authorities, in case of infringement, have a range of corrective measures at their disposal, which should be selected upon consideration of the circumstances of each case, which is what the EDPB does in para. 114.
In conclusion, although EDPB Opinion 28/2024 has been drafted to answer a set of very specific questions related to AI models, it contains the seeds of a more general approach to data anonymisation. We shall see if the EDPB will maintain the same direction in its future guidelines on anonymisation. Beyond generic rules of thumbs, what would be extremely useful for these guidelines would be a set of implementing scenarios that would stress the importance of purpose limitation for their assessments (as previously hinted here).
Sophie Stalla-Bourdillon is co-Director of the Privacy Hub. She is also a visiting professor at the University of Southampton Law School of law, where she held the chair in IT law and Data Governance until 2022. She was Principal Legal Engineer at Immuta Research for six years. Sophie is the author and co-author of several legal articles, chapters and books on data protection and privacy. She is Editor-in-chief of the Computer Law and Security Review, a leading international journal of technology law, and has also served as a legal and data privacy expert for the European Commission, the Council of Europe, the Organisation for the Cooperation and Security in Europe, and for the Organisation for Economic Cooperation and Development.