A closer look into personal data
A closer look into personal data
How can (Norw)AI protect personal data?
Protecting personal information is challenging with complex AI models that are hungry for data. NorwAI’s pledge to provide an individualized AI experience that provably respects privacy concerns is therefore more important than ever.
For traditional data, like tabular data, the privacy challenges are fairly well understood. With unstructured data like text, however, privacy is less well-defined. Take this sentence as an example:
“The applicant, Dr Royce Darnell, who was born in 1929, has been unemployed since the Trent Regional Health Authority ("the RHA") terminated his employment as a consultant microbiologist and Director of the Public Health Laboratory in Derby.“
The name is a direct identifier, which is fairly easy to locate and possibly mask, while access to some of the phrases in italics can be seen as indirectly identifying information, depending on which external sources of information – like the Internet – the reader might have access to.
Without metrics or benchmarks to measure how well personal data in texts are protected it is not possible to measure how a large language model deals with privacy or train/tune a large language model so that it does not remember too much personal information. And since privacy breaches can happen in many and sometimes mysterious ways, one single metric is not sufficient. At NR, also via NorwAI, we have therefore over several years been developing NLP (natural language processing) methods to:
- detect entities (like names, addresses, etc.)
- link entities (Dr Royce Darnell may be called just Royce in parts of the text, or even Roycy?)
- detect confidential attributes (like sexual orientation)
- measure how well a method is able to detect or mask personal information (that is evaluation metrics to protect personal data)
Such methods can for example be used when training or fine-tuning the NorGPT model to reduce the chances that the model contains or leaks personal data.
Open science is not possible without open data
Since personal data typically is not available for open science, we have constructed the TAB (Text Anonymization Benchmark); an open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. The TAB corpus goes beyond traditional de-identification, and explicitly marks which text spans ought to be masked to conceal the identity of the person to be protected.
When GPT-4 was released in 2023, the TAB was utilized to test its anonymisation performance (see the paper "Sparks of artificial general intelligence: Early experiments with GPT-4.").
Is anonymisation possible at all from a technical or legal viewpoint?
As mentioned, much of the legal and technical literature on data anonymisation has focused on structured data such as tables. However, unstructured data such as text documents or images are far more common, and the legal requirements that must be fulfilled to properly anonymize such data formats remain unclear.
In the absence of a definition of the term ‘anonymous data’ in the GDPR (General Data Protection Regulation), we have – together with legal experts from the University of Oslo and the Oslo Metropolitan University – examined what conditions must be in place for the anonymisation of unstructured data. We concluded that anonymisation of unstructured data is virtually impossible if the original data continues to exist. Hence, for all practical purposes only a risk-based approach is realistic, meaning that we can estimate a certain privacy risk for a text corpus or a language model, but a strict guarantee is not possible.
By: Anders Løland, Research director, Norwegian Computing Center (NR)
PUBLISHED: 2024-03-19