Maryna Manteghi, PhD
researcher, University of Turku, Finland
Photo credit: mikemacmarketing
and Liam Huang, on Flickr via Wikimedia
Commons
Background
The Artificial
Intelligence Act (AIA), “the
first-ever legal framework on AI, which addresses the risks of AI and positions
Europe to play a leading role globally” (according
to the European
Commission), contains two
provisions which are relevant to copyright. In particular, Article 53 (1) (c) (d) requires providers
of general-purpose AI models first, to comply with “Union law on copyright and
related rights…in particular to identify and comply with…a reservation of
rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790,”
and second, to “draw up and make publicly available a sufficiently detailed
summary about the content used for training of the general-purpose AI model…”. The
provisions have been added to the text of the
Act to address the risks
associated with the development and exploitation of generative AI (GenAI)
models such as ChatGPT, MidJourney, Dall-E, GitHub Copilot and others (see the Draft
Report of the European Parliament).
TDM in the context of
copyright
AI systems have to be trained on
huge amounts of existing data including copyright-protected works to be able to
perform a wide range of challenging tasks and generate different types of
content (e.g., texts, images, music, computer programs etc.,) (for technical
aspects see e.g., Avanika Narayan et
al). In other words, GenAI models have to learn the inherent
characteristics of real-world data to generate creative content on demand. AI
developers employ various automated analytical techniques to train their
systems on actual data. One example is text and data mining (TDM), the concept
which involves techniques and methods needed to extract new knowledge (e.g.,
patterns, insights, trends etc.,) from Big Data (for a general overview of TDM
techniques and methods see e.g., Jiawei
Han et al). A computer typically makes copies of collected works to be able
to mine (train) AI algorithms.
TDM requires processing of huge
amounts of data, thus training datasets may also contain copyright-protected
works (e.g., books, articles, pictures, etc.,). However, unauthorised copying of
protected works may potentially infringe one of the exclusive rights of
copyright holders, in particular the right to reproduction granted to authors
under Article 2 of the Directive on copyright in the information society (the
InfoSoc Directive). To prevent the
risk of copyright infringement, providers of GenAI have to negotiate licenses
over protected works or rely on a so-called “commercial” TDM exception provided
under Art. 4 of EU Directive
2019/790 on copyright in the digital single
market (CDSM), which, as we have seen
above, is referred to in the AI Act. The provision has been adopted
alongside the “scientific research” TDM exception (Art. 3 of CDSM) to provide
more legal certainty specifically for commercially operating organisations.
However, providers of GenAI
models have to meet two-fold requirements to enjoy the exception of Art. 4 of
CDSM. First, they need to obtain “lawful access” to data they wish to mine
through contractual agreements, and subscriptions, based on open access policy or
through other lawful means, or use only materials which are freely available
online (Art. 4 and Recital 14 of CDSM). Second, AI developers have to check
whether rightholders have reserved the use of their works for TDM by using
machine-readable means, including metadata and terms and conditions of a
website or a service or through contractual agreements or unilateral
declarations, or not (Art. 4 (3) and Recital 18 of CDSM).
The copyright-related
obligations of the AI Act: a closer look
It appears that Article 53 (1) (c) of the Artificial Intelligence Act ultimately dispelled all doubts regarding
the relevance of Article 4 of CDSM
to AI training by obliging providers of GenAI to comply with the reservation
right granted to rightholders under this provision. The arguments in favour of
this idea could also be derived from the broad definition of TDM included in
the text of CDSM (“any automated analytical technique aimed at analysing text
and data in digital form in order to generate information…” Article 2 (2) CDSM) and the aim of Article 4 of CDSM that is to enable the use
of TDM by both public and private entities for various purposes, including for
the development of new applications and technologies (Recital 18 of CDSM) (see
e.g., Rosati here
and here;
Ducato
and Strowel; and Margoni and
Kretschmer).
Further, the new transparency
clause of the AI Act
requiring providers of GenAI models to reveal data used for pre-training and training
of their systems (Article 53 (1)
(d) of AIA and recital 107) could also bring more certainty in the context of
AI training and copyright. Recital 107 of the
Act clarifies that providers of GenAI models would not be required to
provide a technically detailed summary of sources where mined data were scraped
but it would be sufficient to list “the main data collections or sets that went
into training the model, such as large private or public databases or data
archives, and by providing a narrative explanation about other data sources used”.
This clarification could make the practical implementation of the transparency
obligation less burdensome for AI developers taking into account huge masses of
data used for mining (training) of AI algorithms. The transparency obligation
under Article 53 (1) (d) of the Act would allow rightholders to determine whether their works have
been used in training datasets or not and if needed, opt out of them. Therefore,
the provision would literarily enable the work of an “opt-out” mechanism of Article 4 (3) of CDSM.
However, the “commercial” TDM
exception may not be a proper solution for AI developers as their ability to
train (and thus develop) their systems would depend on the discretion of
rightholders. What does it exactly mean? Put simply, there are some issues
which could restrict or even prohibit the application of TDM techniques. First,
the exception can be overridden by a contract under Article 7 of the CDSM Directive. Second, rightholders may restrict
access to their works for TDM by not issuing licenses or raising
licensing/subscription fees. Moreover, even if users would be lucky enough to
obtain “lawful access” to protected works rightholders can prohibit TDM in
contracts, terms and conditions of their websites or by employing technological
protection measures. Third, rightholders may employ an “opt-out” mechanism to
reserve the use of their works for TDM, thereby obliging TDM users to pay twice-
first to acquire “lawful access” to data and a second time to mine (analyse) it
(see Manteghi). In this sense,
rightholders literally would control innovation and technological progress in
the EU as the development of AI technologies heavily relies on TDM tools.
Concluding thoughts
To sum up, the copyright-related
obligations of the AI Act
could alleviate (to some extent) the conflict of interest between copyright
holders and providers of GenAI models, providing
that training of AI models should be covered by the specific copyright
exception and be subject to a transparency obligation would bring more clarity
to the regulation of AI development. However, major concerns remain regarding the
excessive power granted to rightholders under the “lawful access” requirement
and the right to reservation of Article
4 of CDSM. The author of this blog does not support the idea of making
copyright-protected works freely available for everyone but rather wants to emphasise
the risks of the deceptively broad “commercial” TDM exception. The future of AI
development, innovation and research should not be left at the discretion of
copyright holders. The purpose of AI training is not to directly infringe
copyright holders’ exclusive rights but to extract new knowledge for developing
advanced AI systems that would benefit various fields of our lives. Therefore,
the specific TDM exceptions should balance the competing interests in practice
and not tip the scales in favour of a particular stakeholder that would only
create more tension in the rapidly evolving algorithmic society.