New language models in NorwAI
New Language Models in NorwAI
Recent advances in natural language processing depend on the availability of large scale language models that help applications interpret, analyze and generate natural language text with high precision. A language model learns to predict the probability of the next word by analyzing the text in a sentence.
Some of the most famous language models are smartphone keyboards that suggest the next word based on what you have currently typed.
In the last few years, language models have proven very useful in popular applications like Google Assistant, Siri, and Amazon’s Alexa.
Unfortunately, since the publicly available models for a small language like Norwegian are based on rather small data sets, Norwegian applications tend not to be at the same level as their English counterparts. NorwAI’s consortium includes some of the best computational linguists in Norway, and the center is determined to provide new Norwegian language models that are significantly larger and better than what is available today and can easily be employed in advanced Norwegian NLP applications.
One of the main challenges in NLP is the availability of sufficient training data. With large-scale deep learning language models, huge amounts of training data are necessary, though both large specific training data and good human annotations are often lacking. A solution to this is to pre-train the model on large, noisy and unannotated general text data first, and then fine-tunes the pre-trained model on smaller-sized and well-annotated specific training data afterwards. Two well known examples of such pre-trained language models are BERT and GPT. They are both trained on massive data sets, provide a good basis for general language understanding, and have improved the performance of many NLP applications significantly.
BERT and GPT are different in their structures and training tasks and tend to be suitable for different purposes. Whereas GPT has a traditional unidirectional structure and is trained to predict “the next word”, BERT has a bi-directional structure that helps us predict a randomly masked word. Making use of the full context of the sentence to predict a word, BERT is normally to prefer for analysis tasks like sentence classification, sentiment analysis and named entity recognition. GPT, on the other hand, is more used for generation tasks like machine translation, summarization and conversation generation. Both BERT and GPT are today widely used in both research and business applications.
NorBERT, released by the University of Oslo this year, is a BERT deep learning language model trained from scratch for Norwegian.
Its training corpora consist of Norwegian Wikipedia and a Norwegian news corpus from Språkbanken - Norsk Aviskorpus, roughly equivalent to two-billion-word tokens. The vocabulary for NorBERT is about 30 000 and has a substantially higher coverage of Norwegian than the multilingual BERT models from Google. The model was trained on the Norwegian academic HPC system Saga with four compute nodes and 16 NVIDIA P100 GPUs over a three week period.
NorwAI is now in the process of training a GPT-2 language model for industrial use. To train the GPT-2 model, apart from NorBERT training data sets, we plan to use recent news articles published by Norwegian media houses and subtitles from NRK productions. Computationally, the generation of GPT-2 language models is extremely demanding, and we are working together with Sigma2, an organization responsible for managing the national e-infrastructure for computational science in Norway, to allocate the necessary computing resources. Supercomputers are needed to build these language models, but in the end, we will have models that are comparable to the best English language models and NLP applications that are able to communicate properly with humans in Norwegian.
The vocabulary for NorBERT is about 30 000 and has a substantially higher coverage of
Norwegian than the multilingual BERT models from Google.