Harmful behavior in language models

Harmful behavior in language models

The responses from a language model reflect the data that goes into its training set. If the training data is incomplete, the model will combine words based on statistical probabilities and construct sentences that may be both plausible and grammatically correct but have little to do with reality.

A typical example in Norwegian is when the model lacks a specific answer in the training data and instead generates a grammatically correct but incorrect response. For instance:

Question: Who was Hans Hyldbakk?

ChatGPT’s Response: “Hans Hyldbakk was a Norwegian author and poet, born on March 4, 1939, and died on July 18, 2001. He was originally from Vefsn in Nordland, Norway. Hyldbakk was best known for his work with poetry in Nynorsk, and he also wrote in the dialect of Vefsn. Hans Hyldbakk made his debut as an author in 1966 with the poetry collection ‘Å i ætt med vinden’ (‘In Line with the Wind’). He was later recognized for his contribution to Norwegian poetry and received several awards for his work. He wrote about nature, people, and culture in Northern Norway, and his poems are known for being close to the earth and influenced by the northern Norwegian dialect.”

The correct answer is that Hans Hyldbakk comes from Surnadal in Nordmøre. He is most famous for his local history books about Nordmøre and some poems that Henning Sommerro has set to music. The response from ChatGPT above is what we call a hallucination.

DeepMind’s List

A systematic review of ethical and social risks from language models was published by DeepMind in 2021. Some language models handle these issues better than others, and since language models generate new text with each response, it is challenging to reproduce the examples from DeepMind. However, DeepMind’s classification provides concrete examples of how norms and values in the training data and training methods influence the behavior of the language model. The following problem areas were identified:

  • Discrimination, exclusion, and toxic language
  • Information leakage
  • Disinformation
  • Abuse
  • Damage to human-machine interfaces
  • Automation, access, and environmental harm