NorwAI to introduce large Norwegian GPT model

NorwAI to introduce large Norwegian GPT model

NorwAI will over the next couple of months build a large Norwegian GPT-type language model using a newly collected dataset at NTNU.  As soon as the model has been generated and gone through some initial testing, we will make an API available to partners interested in playing around with it. 

This is an important milestone for the research center and a tangible result two years into NorwAI’s lifespan. The model is developed in close cooperation with NorwAI partners Schibsted and DnB, both represented in the steering committee for the NorwAI GPT Language Modeling Project.

Professor Jon Atle Gulla, head of Norwegian Research Center for AI Innovation, says the project has been in the making since NorwAI started its operations back in October 2020. Last year the research center built a smaller beta model for testing purposes. To build the first large language model based on Norwegian sources has been an important objective from the start. 

 -I am happy to say that we now can go public with the news. The project is well under way, and we have already run some tests on NTNU’s supercomputer infrastructure to assess the computational needs when building such language models., says Jon Atle Gulla.

Billions of parameters

NorwAI’s GPT model will have 20 billion parameters and is based on 18,5 billion tokens drawn from several Norwegian training corpuses including new data from media partner Schibsted. Additional tokens from related languages like Swedish and Danish are also included. NorwAI is training the model on NTNU’s Idun cluster.

The smaller beta model for testing purposes worked well last year. The version now on its way represents a major step up to meet NorwAI’s and its partners ambitions. Schibsted is working intensively with language models on its own and are now making use of the additional strength and competence the researchers at NorwAI can provide. The new version will be an important stepping stone for further and even bigger versions of the model planned to be followed up successively later this year.

Partner University of Oslo

The University of Oslo is also actively pursuing large language models these days. So far their work has centered around BERT models, though they are now also looking into other types of language models. Whereas NTNU makes use of their own infrastructure, the University of Oslo has now started experimenting with the LUMI supercomputer resources in Finland.  LUMI is one of the most powerful supercomputers in Europe, is partly funded by the European High-Performance Computing Joint Undertaking and will be an important computational resource for language modeling across a number of European countries.

On the horizon there are other future Language Models that can be attractive for both research and innovations as the big tech companies are competing for this new market.

ChatGPT

No doubt the great public interest after ChatGPT was introduced by the end of November last year, has spurred NorwAI to intensify its work. The English based ChatGPT also answers prompts in Norwegian, but the results are not so impressive. Obviously, the amount and richness of English data dominate the data available for smaller languages like Norwegian in ChatGPT. Many users have given public examples of ChatGPT mixing fact and fiction when asked in Norwegian or asked about Norwegian topics.

-Large Language Models are not true wikipedias. Verifying content from language models is a large research area for the future. But language models will have an important impact on work, professions, and the society as the models are further developed and improved. We will also see  new applications supported by the models that will change the way we act with machines, says Jon Atle Gulla.

Digital independence

He emphasizes the importance of building national language models for our mother language. The perspective is to keep up our digital independence to escape a monopoly of international technology companies that are not likely to give as much attention to smaller languages like Norwegian as they do to languages like English

-Norway has the opportunity to develop its own language models if we are willing to invest in data infrastructure, find ways to use the richness of content produced over the years and continue to foster our national AI competence, says Jon Atle Gulla.

Jon Atle Gulla speaking on stage
Professor Jon Atle Gulla. Photo: Kai T. Dragland, NTNU

 


PUBLISHED: 2023-01-31

By: Rolf Dyrnes Svendsen