Lessons learned about Language Models
Lessons learned about Language Models:
Some Observations from the Language Model Work
Interesting aspects are coming to light working with language models:
- Transparency: International models often lack transparency. We don’t have access to the details of their datasets, nor do we receive information about the methods they use for data cleaning, fine-tuning, model adaptation, or handling toxic or discriminatory language.
- Copyrights: Most international models utilize copyrighted content without proper authorization. This includes for example Norwegian books available in online collections.
- Sustainability: Training large language models requires significant computational resources, substantial energy consumption, and involvement of many individuals. OpenAI has faced criticism for employing underpaid Kenyans to filter out toxic or biased language from their models. Training large language models can easily consume around 1,000 megawatt-hours of electricity.
- Values and norms: International models tend to reflect the norms and values of dominant languages in their training data. This manifest in the types of questions the models avoid answering and, more significantly, in the overall meaning and balance of the generated text.
- Language variants: Language models are surprisingly good at distinguishing between different languages. Abstractions from one language in the model appear to assist in handling related languages. Training texts in Bokmål (one of the Norwegian written forms) seem useful for generating responses in Nynorsk (another Norwegian written form).
At NorwAI, we are actively developing the concept of responsible language models. This involves full methodological transparency, agreements with rights holders, and the use of balanced and quality-assured training data.