The project “MIMIR” on copyrighted content

The project “MIMIR” on copyrighted content

At the end of 2023, an initiative emerged that brought the three most active environments in Norway with expertise in language models to collaborate more closely. The “Mimir” project united the National Library of Norway, the University of Oslo, and NorwAI in a joint effort. 

On December 8, 2023, the National Library of Norway received the following assignment from the Ministry of Culture and Equality: 

“We refer to the dialogue with the National Library about the development of generative artificial intelligence, and issues raised by the institution related to the use of copyright-protected material in the training of Norwegian language models. The Ministry hereby requests the National Library to initiate a coordinated research/development project to possibly investigate the value of copyright-protected material in the training of Norwegian generative language models. Relevant Norwegian research environments should be invited to participate in the project. Authors’ and publishers’ organizations are invited to follow the project. We ask the National Library, based on the results of the research project, to consider the basis for a possible compensation scheme for Norwegian rights holders, and possibly develop a proposal for such a scheme.” 

On December 20, NorwAI had a meeting with the National Library, which briefed on the mandate from the department and aired the idea of a collaboration with NorwAI and the Language Technology Group (LTG) at the University of Oslo. 

Deadline in the summer

In a meeting on January 5, the National Library, represented by CEO Aslak Sira Myhre, LTG represented by professor Erik Velldal and NorwAI represented by professor Jon Atle Gulla agreed to run a joint project until the summer of 2024 to assess the value of copyright-protected content in large generative language models. 

The compensation scheme is kept outside of this project. The National Library takes on the responsibility of creating one training dataset without copyrighted content and one with it. It is up to NorwAI and LTG to define which models need to be trained and which evaluations must be conducted to assess whether copyrighted content in the training data results in better language models. 

As of writing in April, the project is well under way to report in the summer of 2024.  Both the NTNU’s infrastructure Idun and the joint European infrastructure LUMI in Finland are in use for the training. 
 

Portrait picture of Aslak Sira Myhre
CEO Aslak Sira Myhre, National Library of Norway. Photo: Gorm K. Gaare
Portrait of Erik Velldal
Professor Erik Velldal, LTG, Oslo University
Portrait of Jon Atle Gulla
Professor Jon Atle Gulla, NorwAI, NTNU