Hostname: page-component-745bb68f8f-grxwn Total loading time: 0 Render date: 2025-02-04T14:52:40.610Z Has data issue: false hasContentIssue false

Development of Large Language Models: Copyright Law Perspectives for Research Institutions and Research Libraries

Published online by Cambridge University Press:  03 February 2025

Inger Berg Ørstavik*
Affiliation:
Department of Private Law, University of Oslo, Oslo, Norway. Email: i.b.orstavik@jus.uio.no.

Abstract

This article discusses European copyright law as applied to the development and training of generative AI and natural language processing in public interest research institutions and libraries. The article focuses on the scope of the new exceptions from copyright law for text and data mining (TDM) for research purposes and discusses them from the perspective of research ethics and principles of open science in publicly financed research. The public interest mission of research institutions and libraries includes the open dissemination of research results but the exceptions from copyright are focused only on the training phase in AI development. Regulation on data transparency is fragmented. The article finds that while new exceptions open for developing language models under research institutions and libraries’ public interest mission to preserve national languages, the regulation is not adapted to principles of research ethics and open science, and legal uncertainty remains.

Type
Article
Copyright
© The Author(s), 2025. Published by International Association of Law Libraries

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Editor’s Note: This article is based on the author’s presentation given at the 42nd Annual Course of the International Association of Law Libraries held in Oslo, Norway, 14 – 20 June 2024.

References

1 See, for instance, the models under development by the National Library of Norway, accessed Oct. 8, 2024, https://ai.nb.no/models.

2 Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC (DSM Directive). Art. 3(1): “Member States shall provide for an exception to the rights provided for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/29/EC, and Article 15(1) of this Directive for reproductions and extractions made by research organizations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access.” See also recitals (8) and (10).

3 Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonized rules on artificial intelligence (AI Act).

4 Recital (105), AI Act: “Any use of copyright protected content requires the authorization of the rightsholder concerned unless relevant copyright exceptions and limitations apply. Directive (EU) 2019/790 introduced exceptions and limitations allowing reproductions and extractions of works or other subject matter, for the purpose of text and data mining, under certain conditions. Under these rules, rightsholders may choose to reserve their rights over their works or other subject matter to prevent text and data mining, unless this is done for the purposes of scientific research.”

5 Cf. Report from the Norwegian National Library, Evaluating the effect of copyright protected materials in generative large language models for Norwegian languages (2024), https://www.nb.no/content/uploads/2024/08/Mimirprosjektet_teknisk-rapport.pdf. Norwegian language.

6 Cf. Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information (Open Data Directive), Art. 10. From national law, the Norwegian Universities and Colleges Act § 2-1 requires institutions to contribute to innovation and value creation based on research results. See also the EU Commission’s open science policy, https://research-and-innovation.ec.europa.eu/strategy/strategy-2020-2024/our-digital-future/open-science_en.

7 See, e.g., Josef Drexl et al., “Technical Aspects of Artificial Intelligence: An Understanding from an Intellectual Property Law Perspective,” Max Planck Institute for Innovation & Competition Research Paper no. 19-13 (Oct. 2019): 4, https://ssrn.com/abstract=3465577.

8 See, further, Ørstavik, Inger B., “Access to data for training algorithms in machine learning: copyright law and ‘right-stacking,’” in Artificial Intelligence and the Media,” eds. Pihlajarinne, Taina and Alén-Savikko, Anette (Cheltenham, UK: Edward Elgar Publishing, 2022): 272, 276–78Google Scholar; Lehr, David and Ohm, Paul, “Playing with the Data: What Legal Scholars Should Learn About Machine Learning,” UC Davis Law Review 51 (2017): 653, 655–717Google Scholar; Margoni, Thomas, “Artificial Intelligence, Machine learning and EU copyright law: Who owns AI?,” CREATe Working Paper 12 (2018): 45, DOI:10.5281/zenodo.2001763 Google Scholar; Kop, Mauritz, “Machine Learning & EU Data Sharing Practices,” Stanford-Vienna Transatlantic Technology Law Forum, 1 (2020): 7 Google Scholar.

9 Art. 2(2) of the DSM Directive defines text and data mining as “any automated analytical technique aimed at analyzing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations,” ref. Recital (18). See Meys, Romain, “Data Mining Under the Directive on Copyright and Related Rights in the Digital Single Market: Are European Database Protection Rules Still Threatening the Development of Artificial Intelligence?,” GRUR Int. 69, no. 5 (2020): 457, 464–65. DOI: 10.1093/grurint/ikaa046 CrossRefGoogle Scholar.

10 Art. 2 of Directive 2001/29/EC of the European Parliament and of the Council of 22 May 2001 on the harmonization of certain aspects of copyright and related rights in the information society (Infosoc Directive).

11 Arts. 5(a) and 7(1) of Directive 96/9/EC of the European Parliament and the Council of 11 March 1996 on the legal protection of databases (Database Directive).

12 Art. 15(1), DSM Directive.

13 Recital (5), DSM Directive.

14 Art. 2, Infosoc Directive.

15 The CJEU has found that a text of only eleven words could enjoy copyright protection if it is the expression of the intellectual creation of the author, Case C-05/08, Infopaq I, ECLI:EU:C:2009:465, 39, 47–48.

16 That is, provided that the model is intended to apply a reasonably modern language and not train only on older works for which the copyright has expired.

17 Recital (8), DSM Directive, and recital (105), AI Act.

18 Art. 2, Infosoc Directive.

19 Walter, Michel M. and von Lewinsky, Silke, eds., European Copyright Law: A Commentary (Oxford: Oxford University Press, 2010), 968 Google Scholar; recital (21), Infosoc Directive.

20 Ibid. See also Inger B. Ørstavik (n 8).

21 See, in general, Bernt Hugenholtz, P., ed., Copyright Reconstructed: Rethinking Copyright’s Economic Rights in a Time of Highly Dynamic Technological and Economic Change (Alphen aan den Rijn, The Netherlands: Wolters Kluwer, 2018)Google Scholar. Under US law, this reasoning has led to the application of the exception of “fair use” to machine learning albeit not without friction; see, e.g., Lemley, Mark A. and Casey, Bryan, “Fair Learning,” Texas Law Review 99, no. 4 (Mar. 2021): 743–86Google Scholar.

22 Strowel, Alain, “Reconstructing the Reproduction and Communication to the Public Rights: How to Align Copyright with its Fundamentals,” in ed. Bernt Hugenholtz, P., Copyright Reconstructed: Rethinking Copyright’s Economic Rights in a Time of Highly Dynamic Technological and Economic Change (Alphen aan den Rijn, The Netherlands: Wolters Kluwer, 2018), 206–09Google Scholar.

23 See Jean-Paul Trialle, Study on the legal framework of text and data mining, EC Commission (2014), 32, 29–31, https://op.europa.eu/en/publication-detail/-/publication/074ddf78-01e9-4a1d-9895-65290705e2a5/language-en. It does not matter if the materials are deleted after completion of the training process; cf. Art. 5(1), Infosoc Directive.

24 Geiger, Christophe et al., “Text and data mining in the proposed copyright reform: making the EU ready for an age of big data? Legal analysis and policy recommendations,” IIC 49, no. 7 (2018): 814–44, 818, https://doi.org/10.1007/s40319-018-0722-2 CrossRefGoogle Scholar.

25 Some doubt has been expressed in the literature as to whether a model makes copyright-relevant copies when running on training materials; see Meys 460 (n 9); and Geiger (n 24). This likely depends on the model in question; cf. also the statements in recitals 8 and 9 of the DSM Directive.

26 Case C-360/13, Meltwater, ECLI:EU:C:2014, 1195. See also Ducato, Rossana and Strowel, Alain, “Limitations to text and data mining and consumer empowerment: making the case for a right to ‘machine legibility,’IIC 50, no. 6 (2019): 649–84, 658CrossRefGoogle Scholar; Christoph Geiger et al., “Text and Data Mining: Articles 3 and 4 of the Directive 2019/790/EU,” CEIPI Studies Research Paper no. 2019-08, https://ssrn.com/abstract=3470653, 8. The author’s right is infringed if so much of the work is copied that it includes subject matter that is “original in the sense that it is its author’s own intellectual creation,” Case C-05/08, Infosoc I, ECLI:EU:C:2009:465, 37.

27 Recital (44), Database Directive.

28 Art. 7(5), Database Directive. Ref. discussion in Inger B. Ørstavik (n 8).

29 Case C-203/02, William Hill, ECLI:EU:C:2004:695, 54; Case C-304/07, Directmedia, ECLI:EU:C:2008:552, 51.

30 Recital (49), Database Directive, and Case C-203/02, William Hill, ECLI:EU:C:2004:695, ¶ 51; Case C-202/12, Innoweb, ECLI:EU:C:2013:850, 37; Case C-304/07, Directmedia, ECLI:EU:C:2008:552, 33 and 35.

31 Recital (42), Database Directive; Case C-203/02, William Hill, ECLI:EU:C:2004:695, 47.

32 Recital (24), Database Directive.

33 Case C-203/02, William Hill, ECLI:EU:C:2004:695, 57.

34 See Case C-490/14, Verlag Esterbauer, ECLI:EU:C:2015:735, ¶ 16; Case C-202/12, Innoweb, ECLI:EU:C:2013:850, 46–48. See Lemley and Casey: 127 (n 21); Geiger et al. (2018), 823–24 (n 24).

35 Cf. also Margoni, Thomas and Kretschmer, Martin, “A Deeper Look into the EU Text and Data Mining Exceptions: Harmonisation, Data Ownership, and the Future of Technology,” GRUR International 71, no. 8 (2022): 685701, https://doi.org/10.1093/grurint/ikac054 CrossRefGoogle Scholar. See also recital (107), AI Act, where it is assumed that public databases may be used for AI development based on the public interest in data transparency and data accountability.

36 Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information; recital (107), AI Act.

37 Arts. 1(2) and (4), recital (4), Open Data Directive.

38 Art. 10 ref. Art. 2(9), and recital 27–28, Open Data Directive.

39 Recital (27), Open Data Directive.

40 Ref. Art. 43 of Regulation (EU) 2023/2854 of the European Parliament and of the Council of 13 December 2023 on harmonized rules on fair access to and use of data (Data Act), Article 50 and Article 10 (data quality) AI Act, Article 27 and recital (70) of Regulation (EU) 2022/2065 of the European Parliament and of the Council of 19 October 2022 on a Single Market For Digital Services (Digital Services Act). Cf. also Thomas Margoni and Martin Kretschmer (2022), 699–700 (n 35).

41 Directive (EU) 2015/1535 of the European Parliament and of the Council of 9 September 2015 laying down a procedure for the provision of information in the field of technical regulations and of rules on Information Society services Article 1(1)(b).

42 Recital (10), DSM Directive.

43 Recital (14), DSM Directive.

44 A curious example is the Norwegian Act of 9 June 1989 no. 32 relating to the Legal Deposit of Generally Available Documents, under which publishers are obligated to deposit copies of all published works with the National Library. It is not entirely clear whether the deposited works may lawfully be used for training AI under Art. 3 of the DSM Directive.

45 Recital (14), DSM Directive.

46 Case C-302/10, Infopaq II, ¶ 42, ECLI:EU:C:2012:16, and Case C-403/08, Premier League, 168, ECLI:EU:C:2011:631, and recital (33), Infosoc Directive.

47 Case C-05/08, Infopaq I, ¶ 39, ECLI:EU:C:2009:465; Art. 3(1), Infosoc Directive.

48 Recital (6). DSM Directive.

49 Recital (12), DSM Directive.

50 Ibid.

51 Art. 1(2) and recital (12), DSM Directive.

52 Recital (12), DSM Directive.

53 Recital (13), DSM Directive.

54 Art. 1(2) and recital (12), DSM Directive.

55 Recital (11), DSM Directive.

56 Recitals (8) and (10), DSM Directive.

57 Dermawan, A., “Text and data mining exceptions in the development of generative AI models: What the EU Member States could learn from the Japanese ‘nonenjoyment’ purposes?” Journal of World Intellectual Property 27, no. 1 (2023): 44, 5253, https://doi.org/10.1111/jwip.12285 Google Scholar; Geiger, Christophe, “The Missing Goal-Scorers in the Artificial Intelligence Team: of Big Data, the Fundamental Right to Research and the failed Text and Data Mining Limitations in the CSDM Directive,” PIJIP/TLS Research Paper Series no. 66 (2021), in Intellectual Property and Sports, Essays in Honour of P. Bernt Hugenholtz, eds. Senftleben, Martin et al. (Alphen aan den Rijn, The Netherlands: Kluwer Law International, 2021), 383, 388Google Scholar.

58 Recital (18), DSM Directive. These entities may benefit from the exception in Art. 4 of the DSM Directive, which, requires that right holders are given the opportunity to “opt-out.”

59 Treaty on the Functioning of the European Union, signed on 13 December 2007 (TFEU); cf. recital (12), DSM Directive.

60 Recital (27), Open Data Directive; European Code of Conduct for Research Integrity (rev. ed. 2023), https://ec.europa.eu/info/funding-tenders/opportunities/docs/2021-2027/horizon/guidance/european-code-of-conduct-for-research-integrity_horizon_en.pdf.

61 Recital (12), DSM Directive.

62 European Code of Conduct for Research Integrity, 3 (n 60).

63 EU Commission, Living Guidelines on the Responsible use of Generative AI in Research (Mar. 2024), https://research-and-innovation.ec.europa.eu/document/download/2b6cf7e5-36ac-41cb-aab5-0d32050143dc_en?filename=ec_rtd_ai-guidelines.pdf.

64 Ibid.

65 Ibid.; EU Commission, High-Level Expert Group on Artificial Intelligence, Ethics Guidelines for Trustworthy AI (2019), https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai; recital (27), AI Act.

66 Cf. Report from the Norwegian National Library (n 5).

67 Recitals (10), (17), and (18), DSM Directive.

68 Recital (17), DSM Directive.

69 Eleonora Rosati, Copyright in the Digital Single Market: Article-by-Article Commentary to the Provisions of Directive 2019/790 (Oxford: Oxford Academic, 2021), “Article 3 - Text and Data Mining for the Purposes of Scientific Research,” 41, https://doi.org/10.1093/oso/9780198858591.003.0004.

70 Geiger et al. (2019), 36–37 (n 26).

71 Thomas Margoni and Martin Kretschmer (2022), 700 (n 35).

72 In this direction, with regard to technical protection measures, see recital (16) DSM Directive. See also Case C-403/08, Premier League, ECLI:EU:C:2011:631, 163.

73 Recital (16), DSM Directive.

74 Art. 3(3) and recital (16), DSM Directive.

75 Recital (16), DSM Directive.

76 The provision in Art. 3(3) of the DSM Directive is related to the provision of technical protection measures in Art. 6(3) of the Infosoc Directive. However, technical protection measures under the Infosoc Directive should mitigate the risk of unlawful use of works, and the balancing of interests under this provision is, therefore, fundamentally different from that under Art. 3(3) of the DSM Directive. Christoph Geiger et al. (2019) (n 26), 34, points to the regulation of traffic management measures in Art. 3(3) of Regulation (EU) 2015/2120 of the European Parliament and of the Council of 25 November 2015, laying down measures concerning open internet access as a better guide for how to interpret Art. 3(3) of the DSM Directive.

77 Art. 5(5), Infosoc Directive, and recital (6), DSM Directive.

78 Recital (15), DSM Directive.

79 Ibid.

80 See the discussion in Inger B. Ørstavik, 287–89 (n 8).

81 Recital (6), DSM Directive.

82 Case C-05/08, Infopaq I, ¶ 56, ECLI:EU:C:2009:465; case C-302/10, Infopaq II, para 27, ECLI:EU: C:2012:16; Case C-360/13, Meltwater, ECLI:EU:C:2014:1195, 23

83 Case C-403/08, Premier League, 163, ECLI:EU:C:2011:631. Pihlajarinne, Taina, “Copyright exceptions and limitations – is the principle of narrow interpretation gradually fading away?NIR – Nordiskt Immateriellt Rättsskydd 89, no. 1 (2020): 117, 121Google Scholar; Hugenholtz, P. Bernt, “Flexible Copyright: Can the EU Author’s Rights Accommodate Fair Use?,” in Copyright Law in an Age of Limitations and Exceptions, ed. Okediji, Ruth L. (Cambridge University Press, 2017), 275, 286CrossRefGoogle Scholar, has argued that there is room for a broader weighing of interests under Art. 5(5) of the Infosoc Directive.

84 With reference to Art. 5(1) of the Infosoc Directive, cf. Case C-403/08, Premier League, 164 and 179, ECLI:EU:C:2011:631; Case C-360/13, Meltwater, ECLI:EU:C:2014:1195, 24; recital (31), Infosoc Directive.

85 Case C-516/17, Spiegel Online, ECLI:EU:C:2019:625, 54 and 57.

86 Case C-360/13, Meltwater, ECLI:EU:C:2014:1195, 57–59. See recital (3), DSM Directive, and discussion by Taina Pihlajarinne (2020) (n 83): 122; Geiger et al. (2018) p. 282 (n 24).

87 The three-step test does not apply to sui generis database rights, as the wording of the relevant Art. 8(2) in the Database Directive differs from Art. 5(5) in the Infosoc Directive. Including the three-step test in the DSM Directive might be a step towards a better balancing of interests in the scope of the sui generis database right. In this direction, see Meys, 469 (n 23); DG CONNECT, “Study in support of the evaluation of Directive 96/9/EC on the legal protection of databases” (2018): 25, https://op.europa.eu/en/publication-detail/-/publication/5e9c7a51-597c-11e8-ab41-01aa75ed71a1.

88 Art. 14 of Regulation (EU) 2021/695 of the European Parliament and of the Council of 28 April 2021, establishing Horizon Europe – the Framework Programme for Research and Innovation, laying down its rules for participation and dissemination.

89 For example, the Norwegian Universities and Colleges Act § 2-1.

90 European Code of Conduct for Research Integrity (n 60).

91 EU Commission (n 63), and EU Commission, High-Level Expert Group on Artificial Intelligence (n 65).

92 An extract of as little as eleven consecutive words could infringe copyright, Case C-05/08, Infopaq I, 39, ECLI:EU:C:2009:465.

93 Cf. secs. 3 a and b above. See also Thomas Margoni and Martin Kretschmer (2022): 693–94 (n 35).

94 Cf. sec. 3 c, and Art. 1, nr. 6, Open Data Directive.

95 Recital (58), DSM Directive, with regard to the press publisher’s rights. For database rights, the investment must relate to the obtainment, verification, or presentation of the database contents, Art. 7(1), Database Directive.

96 Lionel Bentley et al., “Strengthening the Position of Press Publishers and Authors and Performers in the Copyright Directive,” study for DG IPOL, 23.

97 Recital (57), DSM Directive.

98 In such cases, it is possible that the application of the three-step test in Art. 5(5), Infosoc Directive, ref. Recital (6), the DSM Directive would provide a basis for finding the scope of Art. 3 of the DSM Directive is overreached.

99 EU Commission (n 65).

100 EU Commission (n 63); recital (107), AI Act.

101 EU Commission 5 (n 63).

102 Cf. report from the Norwegian National Library (2024) (n 5).

103 Notably, in the Open Data Directive; see the section on “Database Rights” in this paper.

104 More generally, see Fiil-Flynn, Sean M. et al., “Legal reform to enhance global text and data mining research,” Science 378, no. 6623 (2022): 951–53CrossRefGoogle ScholarPubMed.