Large language models at University of Oslo

Large language models at University of Oslo

By Erik Velldal, Professor, University of Oslo

April 2024

Portrett av Erik Velldal
Erik Velldal, Professor University of Oslo. Photo: UiO

Many have declared 2023 as the year of Large Language Models (LLM), and it’s hard to disagree. In the Language Technology Group (LTG) at the University of Oslo (UiO), developing language models for Norwegian has been an important priority for several years. While also a NorwAI partner, LTG has not been involved in the LLM efforts of the center. Nonetheless, language modeling has defined our activities in several other collaborations.

While basically all language models these days are based on the so-called Transformer architecture, they come in three main flavors: so-called encoder models, encoder–decoders, and decoders. Simplifying a bit, one can think of the difference as follows. Encoder models, with BERT being the most well-known example, are great for analyzing and forming representations of text. Decoders, like GPT-type models, are geared toward generating texts, while encoder–decoder models like T5 can do both.

Models of all flavors

At LTG, we have trained Norwegian models of all flavors, starting with our first NorBERT models back in 2020, followed by NorT5, and most recently the GPT-like NorMistral and NorBLOOM models that we trained in 2023. An important principle has been to make all models publicly available without any restrictions on use.

Our Norwegian versions of the popular Mistral and BLOOM models are the biggest we have trained so far, with 7B parameters. For training data, we relied exclusively on publicly available sources of Norwegian text, combining the Norwegian Colossal Corpus (NCC) made available by the National Library, combined with various sources of web-crawled text and programming code. While some of the models were trained from scratch for Norwegian, we also trained a version of NorMistral initialized from a pre-trained English model, but still using a custom tokenizer specifically for Norwegian.This “warm-started” model has proved to have the best performance in our evaluations so far.

Preliminary GPT-like models

Importantly, we consider these GPT-like models to be preliminary: So far, they are only pre-trained base models, and we will soon release improved versions that have undergone additional instruction-tuning and preference optimization.
Both next steps in the training pipeline are crucial to arrive at the type of open-ended chat-based problem solving one has come to expect from models like ChatGPT. We have previously released an instruction-tuned version of our Norwegian T5-model, Chat-NorT5, based on machine translated data. It is clear, however, that creating high-quality specifically for Norwegian is needed to fully benefit from these additional fine-tuning steps.
As is so often the case, the bulk of the heavy lifting is carried out by PhD- and postdoctoral fellows, and these LLM efforts are no exception: LTG PhD-fellow David Samuel and postdoc Vladislav Mikhailov have been vital in carrying out both the training and evaluation of the models. Another vital factor has been the computing infrastructure.

LUMI, a supercomputer

In the age of LLMs and deep learning, access to high-performance computing (HPC) facilities with sufficient GPU capacity is crucial to stay competitive. In early 2023, LTG was invited to take part in the pilot testing of the GPU-partition of the new supercomputer LUMI. At the time of writing, LUMI is by far the fastest supercomputer in Europe and ranked 5th globally. While physically located in Finland, the cluster is hosted by a consortium that includes Norway (represented by Sigma2) among nine other European countries.

In Norway we have generally had the luxury of being quite well off when it comes to national HPC facilities, and LTG has a long tradition of putting these to good use. Still, there is no denying that LUMI has opened entirely new possibilities. When we trained our first BERT-models on the Saga cluster back in 2020, it would take us two months to train a 100M parameter model. On LUMI, we can now train the same model in six hours. And it takes us two weeks to train our 7B parameter GPT-like models.

Benchmarking

An essential part of model development is benchmarking, i.e., being able to systematically evaluate and compare different models across different downstream tasks. While LTG has already created many resources for benchmarking Norwegian language models, like those collected in the NorBench suite we released in 2023, this is mostly geared towards encoder-based models, and we are currently focusing on adapting and extending this to generative models.

These efforts will be intensified in 2024 through the ongoing collaboration between NorwAI and LTG in the “Mímir” project – a national project including Nasjonalbiblioteket - explained in the following article.