A national call for cooperation - Media to contribute to NorwAI’s LLM

A national call for cooperation - Media to contribute to NorwAI’s LLM

We need a Norwegian language model! 

- We have everything to gain by working together on large language models. A work is already underway.  We at Schibsted would like to have all of you with us.

Sven Størmer Thaulow reached out to his fellow media colleagues when he addressed Norwegian media executives at “Medialeder” – the yearly gathering of top media management in Norway on May 10th in Bergen. 

Sven Størmer Thaulow on stage
Sven Størmer Thaulow at Medialeder 2023

Schibsted’s EVP and Chief Technology and Data Officer used the opportunity to ask media companies to join in on the research efforts in building a Norwegian language model which would compete with foreign ones.  By spring 2023, NorwAI is finishing its third experimental large language model called NorGLM-23 for live testing. Next step will be to double the size of the model to 50 billion parameters. 

Sven Størmer Thaulow argued by pointing out three main reasons for modeling a Norwegian alternative:

  1. We think such a model would be much better in Norwegian because we use large amounts of Norwegian text to train it with. For comparison: In ChatGPT, we estimate that only 0.36% of the texts are in Norwegian.
  2. We want to have control over our own infrastructure. Artificial intelligence is already becoming a global industrial policy race, with large input factors and economies of scale. It is not a given that the technology will be democratized. It can also be directly misused for evil purposes.
  3. We want models that correspond to Norwegian and not American culture and worldview. Let me give an example: Who uses these language models the most today? It is our children, who, for example, use ChatGPT in schoolwork. From their perspective, it is almost a personalized textbook. Our society has always had control over the textbooks our children use in school. But now it is at least as important to have control over what the language models spit out and ensure that it reflects the values our society is built on.
Sven Størmer Thaulow
Sven Størmer Thaulow with three main reasons for a norwegian language model.

Demanding task

Sven Størmer Thaulow said creating language models is a very demanding task:
At least three things are needed:

  • Huge amounts of text from which the model can learn
  • Massive computing power
  • Specialized competence

He said Norway has many highly skilled people with great insight into artificial intelligence and language models. He also stated that the country partially have massive computing power, although it is not enough. Here it would have been a great advantage if the authorities stepped in to ensure enough computing power:

-  Finally, we need a huge amount of more text to make it happen. And the texts must be representative of Norwegian society, from the simplest chat, current news, judgments, public documents - to the most beautiful novel.

For a long time, media companies of Norway have set the conditions for Norwegian language development. Together, they produce an enormous amount of content - with solid routines to ensure that the content reflects reality. Therefore, they have everything to gain from letting all the good content, which is created in all our media houses, become part of the basis of a Norwegian language model:

-  If we have the language model up and running, more needs to be done. We know that language models out of the box will unfortunately reflect all prejudices and norms from the source material. They can generate both bomb recipes, crude threat letters and Nazi manifestos - if such texts are found in the content they have been trained on. The models must therefore be secured so that they do not produce unwanted material. This, so-called alignment, is in itself a very extensive and labor-intensive task. We all want to gain from this work taking place from a Norwegian linguistic and cultural context. And that is not the case for the existing major language models, from OpenAI and Google.

Monocultural

Sven Størmer Thaulow quoted researcher Jill Rettberg at the University of Bergen:  

-    ChatGPT is multilingual, but monocultural.

He argued this is one of the reasons Norwegian media need language models that are adapted to Norwegian and European culture, that are trained in our languages, and can be used without American control. He referred to the work at NorwAI - the Norwegian Research Center for AI Innovation - at NTNU, funded by the Norwegian Research Council to reach this goal:

-    Schibsted - and NRK - contribute here with texts from all our articles and videos. We also work with other media houses, publishers, and public institutions to obtain more data. This data is used to train the model. NorwAI's ambition has been to have the first version of a Norwegian language model with 23 billion parameters, based on 18.5-billion-word occurrences or so-called tokens, in place by the summer. The model will be finished in these days, he said. 

To illustrate the immense work needed, he told his collogues that the current model has been prosessed at the Idun cluster, the largest supercomputer rig in Norway, for 3 months, and will soon to be tested on how good the model is compared to other international competitors. 

-    The next plan is to build a model with 40-50 billion parameters to be build the coming autumn. By comparison, GPT4 has 1000 billion parameters, he said. For that we need even more computing power than we have today. If a local Wallenberg family or the authorities in Norway don't get involved, and they should if we are to have any chance in this race, then we have to go abroad, adding that NorwAI has reserved space on the EU supercomputer LUMI in Finland - and booked half of Norway's annual capacity there. 

Critical need

The critical need now is to get hold of twice as much content to build twice the size of the NorGLM-23 model. Media companies cannot sit alone on their own turf in this landscape. They must work together - both with each other and with other expertise environments. Sven Størmer Thaulow also said it even may be not will enough to gather forces in Norway: 

-    Therefore, several of us also support other European initiatives, including an attempt to build a language model for Germanic languages together with Europe's largest research institution - the Fraunhofer Institute. It will include Norwegian, Swedish, Danish, Icelandic, German and Dutch.

If we manage to create a Norwegian or Nordic or Germanic language model that is better than the American ones, the next step will be to take it from research to a professional service. 

- It will be demanding to build, but with great potential upside where we in the hall can become users, contributors, and perhaps even part owners.

- Artificial intelligence is, as I said, a change of scene. Both we and other industries risk being disrupted - in whole or in part.

- But mostly we must focus on the possibilities. Invest in competence. Experiment. Make sure our employees understand what a technological shift this is. It's not enough to just talk about it - but we have to launch products. Build things. Make mistakes. And learn.

- So let's join forces to explore the possibilities, create the right infrastructure and help influence developments in this significant technological stage shift. We believe that a Norwegian language model is an absolutely essential building block to be able to create value with generative AI in Norway. We ask everyone to contribute content to train the Norwegian language models, push the authorities for access to more powerful supercomputers and share as much of what we create as we can. Then we should rather compete to create good journalistic content.

Specifically, he wanted all media companies to join as associate partners in the NTNU collaboration to build a Norwegian language model. 

- It won't cost money, but we need content - now primarily for research purposes. 


Published: 2023-05-15

By: Rolf Dyrnes Svendsen

PUBLISHED: 2023-05-15