Four models built - four new ones in the pipeline

NorwAI has built four distinct Norwegian generative language models. During winter of 2023/2024, an additional four models are being developed which will be made available in spring 2024. Collectively, these eight models represent steps toward NorwAI’s ambition to build a comprehensive generative base model for general use, with approximately 40 billion parameters by the end of 2024.

While three of the first models were built with GPT-architecture and an additional one on the Llama-3-model, three of the next generation NorLLM-models will focus on the French Mistral solution. Also another Llama-model will be launched.

NorwAI's first four models
Model	Dataset	Model architecture	Fine-tune properties
NorGPT-03	13GB with Norwegian wikipedia pages and news articles	GPT-2 (OpenAI), 350 million parameters	Text generation
NorGPT-3	200GB, with news , books, public documents, web pages, etc. 70 % of the data is in Norwegian	GPT-2 (Open AI), 3 billionparameters	Text generation Dialogue (Chat) Questions / Answers Text summarization
NorLlama-3	200GB, with news , books, public documents, web pages, etc. 70 % of the data is in Norwegian	Llama (Meta), 3 billion parameters	Text generation Dialogue (Chat)
NorGPT-23	200GB, with news , books, public documents, web pages, etc. 70 % of the data is in Norwegian	GPT-2 (Open AI), 23 billion parameters	Text generation Optimized with human feedback to match human language and thinking.

Two tracks into 2024

In 2024, NorwAI’s language modeling efforts will diverge into two distinct tracks. There is a need to train a Norwegian generative model that is sufficiently large to support the Norwegian language at the same level as English and other major languages supported by international models. Simultaneously, there is a demand for smaller models that can be easily optimized for specific use cases and controlled locally using proprietary data and custom adaptations.

The new models in the NorwAI portfolie will be somewhat smaller in size (about 7 billion parameters), but nevertheless contain others features that will be useful to players in both public and commercial sectors.

Bokmål and Nynorsk

Through proper training, the number of parameters captures subtle nuances in language, including orthography, syntax, semantics, and pragmatics. These parameters are learned during training, and more training data typically results in both more precise parameter values and more factual information from which to draw answers (and thus fewer hallucinations). The initial four models had increasing data availability and expanded functionality, as shown in the table below. These models handle both written forms of Norwegian, although they tend to default to Bokmål when lacking relevant responses in Nynorsk.

The next three models will be trained on Norwegian data without copyrighted content, using different architectures to uncover technological features and assess the practical suitability of each model architecture.

Trained at NTNU

The training of language models is organizationally located in NorwAI in Trondheim and is carried out by NTNU's academic staff in collaboration with selected partners such as Schibsted.

In addition, NTNU provides access to the supercomputer cluster Idun (https://www.hpc.ntnu.no/idun/ ) and heavy technical expertise.
Idun, which is a collaboration between NTNU's faculties and the IT division constitutes a professional infrastructure for heavy computing that is well suited to computationally demanding AI models is today powerful enough for long-term training of language models. For example, it took NorwAI 76 days to train NorGPT-23 at Idun in 2023.

Some - such as the University of Oslo - use the Finnish LUMI computer (https://www.lumi-supercomputer.eu/ ), but Norway has limited access to this infrastructure, which is also less flexible for our language modeling work.

Three people in front of server racks. — (From left)Associate professor Benjamin Kille, Postdoc Lemei Zhang and researcher Peng Liu are core members of the NorLLM team.
Photo: Kai T. Dragland, NTNU

2024-04-26

By: Jon Atle Gulla & Rolf Dyrnes Svendsen, NorwAI