Four models built - four new ones in the pipeline
Four models built - four new ones in the pipeline
NorwAI has built four distinct Norwegian generative language models. During winter of 2023/2024, an additional four models are being developed which will be made available in spring 2024. Collectively, these eight models represent steps toward NorwAI’s ambition to build a comprehensive generative base model for general use, with approximately 40 billion parameters by the end of 2024.
While three of the first models were built with GPT-architecture and an additional one on the Llama-3-model, three of the next generation NorLLM-models will focus on the French Mistral solution. Also another Llama-model will be launched.
Model | Dataset | Model architecture | Fine-tune properties |
---|---|---|---|
NorGPT-03 | 13GB with Norwegian wikipedia pages and news articles |
GPT-2 (OpenAI), 350 million parameters |
Text generation |
NorGPT-3 |
200GB, with news , books, public documents, web pages, etc. 70 % of the data is in Norwegian |
GPT-2 (Open AI), 3 billionparameters |
Text generation Dialogue (Chat) Questions / Answers Text summarization |
NorLlama-3 |
200GB, with news , books, public documents, web pages, etc. 70 % of the data is in Norwegian |
Llama (Meta), 3 billion parameters |
Text generation Dialogue (Chat) |
NorGPT-23 |
200GB, with news , books, public documents, web pages, etc. 70 % of the data is in Norwegian |
GPT-2 (Open AI), 23 billion parameters |
Text generation Optimized with human feedback to match human language and thinking. |
Two tracks into 2024
In 2024, NorwAI’s language modeling efforts will diverge into two distinct tracks. There is a need to train a Norwegian generative model that is sufficiently large to support the Norwegian language at the same level as English and other major languages supported by international models. Simultaneously, there is a demand for smaller models that can be easily optimized for specific use cases and controlled locally using proprietary data and custom adaptations.
The new models in the NorwAI portfolie will be somewhat smaller in size (about 7 billion parameters), but nevertheless contain others features that will be useful to players in both public and commercial sectors.
Bokmål and Nynorsk
Through proper training, the number of parameters captures subtle nuances in language, including orthography, syntax, semantics, and pragmatics. These parameters are learned during training, and more training data typically results in both more precise parameter values and more factual information from which to draw answers (and thus fewer hallucinations). The initial four models had increasing data availability and expanded functionality, as shown in the table below. These models handle both written forms of Norwegian, although they tend to default to Bokmål when lacking relevant responses in Nynorsk.
The next three models will be trained on Norwegian data without copyrighted content, using different architectures to uncover technological features and assess the practical suitability of each model architecture.
Trained at NTNU
The training of language models is organizationally located in NorwAI in Trondheim and is carried out by NTNU's academic staff in collaboration with selected partners such as Schibsted.
In addition, NTNU provides access to the supercomputer cluster Idun (https://www.hpc.ntnu.no/idun/ ) and heavy technical expertise.
Idun, which is a collaboration between NTNU's faculties and the IT division constitutes a professional infrastructure for heavy computing that is well suited to computationally demanding AI models is today powerful enough for long-term training of language models. For example, it took NorwAI 76 days to train NorGPT-23 at Idun in 2023.
Some - such as the University of Oslo - use the Finnish LUMI computer (https://www.lumi-supercomputer.eu/ ), but Norway has limited access to this infrastructure, which is also less flexible for our language modeling work.
2024-04-26
By: Jon Atle Gulla & Rolf Dyrnes Svendsen, NorwAI