The GEDI-OpenAI Case: A Warning About the Processing of Sensitive Data for AI Training

The intersection of artificial intelligence (AI) and data protection law has increasingly become a contested legal battleground, particularly in the context of large-scale data sharing agreements between media organizations and AI developers. A case highlighting this tension is the recent warning[i] issued by the Italian Data Protection Authority (Garante per la Protezione dei Dati Personali) [hereinafter, the “Garante”] against GEDI Gruppo Editoriale S.p.A. [hereinafter, “GEDI], one of Italy’s largest media groups, for its planned sharing of editorial content, which contains sensitive data, to OpenAI OpCo, LLC [“OpenAI’], a US-based company that developed ChatGPT.

The analysis that follows critically examines the ruling of the Garante through the dual lens of the General Data Protection Regulation [“GDPR”] and the Artificial Intelligence Act [“AI Act”], assessing the legal implications of using personal data – particularly sensitive personal data – for AI training.

Background of the case

On 24 September 2024, GEDI entered into a data sharing agreement with OpenAI, where GEDI agreed to share a myriad of Italian-language editorial content from various publications such as La Repubblica and La Stampa. The objectives of this agreement were two-fold: a) to allow users of ChatGPT to make real-time searches of news items from the data shared by GEDI, with the simultaneous provision of a summary generated by OpenAI’s AI systems; and b) to enable OpenAI to train its AI algorithms and improve its services.

The Garante raised concerns about this data sharing arrangement particularly regarding the processing of personal data contained in GEDI’s publications. In response to the request for information issued by the Garante, GEDI provided its Data Protection Impact Assessment (DPIA) showing that the declared legal basis for the processing of personal data was GEDI’s supposed legitimate interest of carrying out journalistic activities.

On 27 November 2024, the Garante issued a formal warning to GEDI, highlighting that the agreement could violate several provisions of the GDPR, particularly those relating to the processing of special categories of personal data under Article 9 of the GDPR, also known as sensitive data. The Garante highlighted that legitimate interest is not a condition legitimizing the processing of sensitive data. Furthermore, the Garante questioned whether GEDI’s claim of journalistic innovation is an appropriate legal basis for the training of OpenAI’s AI system, since it falls entirely outside the participation and control of GEDI.

Legitimate Interest is not a Legal Basis for Processing Special Categories of Personal Data

The Garante’s decision underscores the GDPR’s stringent standards when it comes to the processing of sensitive data. Under Article 9 of the GDPR, special categories of personal data are data revealing “racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation.”

The threshold for processing special categories of personal data is higher compared to those required for non-sensitive personal data. As highlighted in Recital 51 of the GDPR, the so-called ‘sensitive data’ should not be processed, unless the processing thereof is expressly allowed under Article 9(2) of the GDPR. These narrow exceptions include explicit consent from the data subject (Article 9(2)(a)), processing necessary for substantial public interest (Article 9(2)(g)), and processing required for legal claims (Article 9(2)(f)), among others. Data controllers, including GEDI, must consequently ensure the presence of at least one clear legal basis before processing or transferring sensitive data to third parties like OpenAI.

In this case, however, GEDI invoked legitimate interest as its sole legal basis. Crucially, and as noted by the Garante, legitimate interest is not among the accepted exceptions for processing sensitive data.

GEDI’s reliance on legitimate interest as legal basis may be seen as understandable from a practical perspective. Obtaining explicit consent from each individual for the use of such data in training the OpenAI model would be practically impossible.[ii] This practical limitation is even more pronounced in GEDI’s proposal of transferring the digital archives of newspapers containing stories of millions of people, thereby rendering the collection of each data subject’s consent exceedingly burdensome, if not impossible. Moreover, data subjects retain the right to withdraw their previously granted consent at any time.[iii] The withdrawal of consent may critically undermine the functionality of an AI system, particularly in instances where the input data has been integrated with other data that are either non-personal or relates to other data subjects.

Despite the aforementioned practical difficulties, it is indisputable that Article 9 of the GDPR does not include legitimate interest as a derogation to the processing of sensitive data. The reasons for such stricter standards are palpable.

First, these categories receive heightened protection due to the particular sensitivity of such data, the processing of which may constitute a serious interference with the fundamental rights to privacy and the protection of personal data.[iv]

Secondly, collecting and using sensitive data is more likely to expose data subjects to some degree of sorting, which in turn may yield discriminatory effects on the basis of racial or ethnic origin, political opinion, religion or beliefs, trade union membership, genetic or health status or sexual orientation. Verily, sorting individuals on the basis of these trait is objectionable from both a moral and economic perspective. In the moral sphere, fairness implies that individuals should not be made to suffer because of factors beyond their control.[v] In the economic sphere, the assumption of causal links between traits and performance is often unwarranted.[vi]

Thirdly, the processing of sensitive personal may result, directly or indirectly, to physical or emotional injury. In its Explanatory Report of Convention 108 as modified by the amending Protocol,[vii] the Council of Europe (COE) adopted a wider approach in justifying the stricter protected afforded to sensitive data.[viii] The COE noted the possibility not only of discrimination, but also of “injury to an individual’s dignity and physical integrity”[ix] leading to encroachments on interests, rights and freedoms.

By way of example, news articles and press releases about past criminal offences, including mere suspected offences, as well as previous or pending criminal proceedings, may reveal a person’s political opinions, thereby adversely influencing any ongoing investigation or hearing. Write-ups may intrude into the most intimate aspect of a person’s life, revealing sensitive information about their sexual life and orientation, the public disclosure of which may lead to emotional trauma or psychological distress. Even pictures on newspapers may reveal sensitive information about a person, including ethnic origin, religious beliefs or health information, which may also be used to subject the person to social ostracization, or even denial of services or employment.

Therefore, GEDI’s invocation of legitimate interest cannot by itself justify the processing of sensitive data, especially at a massive scale, given its potentially significant and adverse implications to data subjects and the potential to cause physical, emotional or psychological harm. A company’s corporate interest must not take precedence over an individual’s fundamental rights, particularly the right to privacy. In view of the requirement of the GDPR that the processing of personal data must always be balanced with the protection of the rights and freedoms of data subjects, fundamental rights serve as a limit to corporate strategies that prioritize profit maximization over the protection of sensitive data.

In any case, the Garante rightfully questioned whether journalistic innovation, as GEDI claimed, may be considered as a legitimate interest for processing data. The Garante appears to have looked at the following two (2) factors when making the assessment: a) the entity processing the personal data; and b) the nature and purpose of the processing. The Garante stressed that the data was intended to be shared to and processed by OpenAI, an entity not directly engaged in journalistic endeavors or media functions. Furthermore, the data shared by GEDI was going to be used by OpenAI for the additional purposes of training the latter’s AI systems, improving its services, and enhancing the accessibility of content by ChatGPT users. Indeed, the training of AI cannot be considered to fall within the scope of journalistic activities, especially when done not by GEDI as a media group, but instead by OpenAI within the context of commercial product development.

2.1 Conditions for Training AI Systems using Special Categories of personal data

Notwithstanding the high standards imposed by Article 9 of the GDPR, it should be noted that there are limited instances when sensitive data may be used to train AI systems. The AI Act, which aims to promote the adoption of human-centric and trustworthy AI while mitigating its potential risks and safeguarding fundamental rights,[x] allows for the use of sensitive data for training AI systems for the purpose of bias detection and correction in relation to high-risk AI systems.[xi] Additionally, compliance with the following conditions is mandatory:

Compliance of the AI system as high-risk;
The implementation of appropriate safeguards to protect the fundamental rights and freedoms of natural persons; and
Full compliance with the requirements set forth in Article 10(5) of the AI Act, including limitations on the transmission, transfer or access by third-parties.

In light of these regulatory constraints, the use of sensitive data for the purpose of training OpenAI’s AI systems presents significant legal obstacles. ChatGPT, as a generative AI model, does not inherently fall within the category of high-risk AI systems as defined by the AI Act.[xii] As observed by the Garante, GEDI does not appear to be in a position to guarantee the rights of data subjects, particularly the right to object, given that such rights can only be exercised directly against OpenAI. Furthermore, compliance with Article 10(5) of the AI Act is likely unattainable in this context. The data-sharing agreement between GEDI and OpenAI is intended to facilitate real-time searches of GEDI’s news archives by ChatGPT users, with OpenAI’s AI system generating corresponding summaries. Crucially, this arrangement effectively grants third-party access to GEDI’s archive, which includes sensitive data. Such a practice directly contravenes the AI Act’s requirement to impose strict limitations on third-party access to sensitive data.

Consequently, the partnership between GEDI and OpenAI raises serious concerns regarding its compatibility with the AI Act’s legal framework.

Conclusion

The decision of the Garante serves as a compelling reminder to companies regarding the high standards required when processing sensitive data, including the use thereof for training AI systems. Consistent with the principle of accountability, companies should ensure that it complies, and is able to demonstrate compliance, with all the mandates under the GDPR and AI Act, including ensuring the presence of legal basis for the processing of sensitive data, and complying with all conditions for the deployment and development of AI systems.

As expected, there are those who question the rigidity of the existing regulatory framework. Data controllers, AI system providers and other critics view existing regulatory frameworks as too rigid, with a potential to stifle innovation and slow down progress in the field of AI development. However, such regulatory framework must be regarded as legal tools that drive responsible AI development. A clear and well-defined legal framework, despite its inherent rigidity, establishes clear and understandable standards that developers can rely on when training and deploying AI systems. By providing legal certainty and compliance guidelines, such a regulatory framework helps mitigate risks and ensures that AI development aligns with fundamental rights and EU values. In turn, this fosters public trust and confidence in AI technologies, promoting their responsible adoption and long-term sustainability.

[i] The formal warning may be accessed here: https://www.garanteprivacy.it/web/guest/home/docweb/-/docweb-display/docweb/10077129

[ii] Oreste Pollicino, ‘Generative AI and the Rediscovery of the Legitimate Interest Clause’ (2023) https://iep.unibocconi.eu/publications/generative-ai-and-rediscovery-legitimate-interest-clause-0 accessed 27 January 2025.

[iii] GDPR, Article 7 (3).

[iv] GC and Others (De-referencing of sensitive data), C‑136/17, EU:C:2019:773, paragraph 44.

[v] Oscar H Gandy Jr, ‘Legitimate Business Interest: No End in Sight – An Inquiry into the Status of Privacy in Cyberspace’ (1996) 1996 U Chi Legal F 77.

[vi] Ibid.

[vii] Council of Europe, Explanatory Report on No 223 of the Council of Europe Treaty Series—Protocol Amending the Convention for the Protection of Individuals with Regard to Automatic Processing of Personal Data (2018) https://rm.coe.int/cets-223-explanatory-report-to-the-protocol-amending-the-convention-fo/16808ac91a accessed 27 January 2025.

[viii] Paul Quinn and Gianclaudio Malgieri, ‘The Difficulty of Defining Sensitive Data—The Concept of Sensitive Data in the EU Data Protection Framework’ (2020) German Law Journal https://www.cambridge.org/core/journals/german-law-journal/article/difficulty-of-defining-sensitive-datathe-concept-of-sensitive-data-in-the-eu-data-protection-framework/5EC5932AAC5703E31D2C90045813F6C6 accessed 28 January 2025.

[ix] Ibid 6.

[x] AI Act, Recital 1.

[xi] AI Act, Article 10.

[xii] European Parliament, ‘EU AI Act: First Regulation on Artificial Intelligence’ (1 June 2023) https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence accessed 28 January 2025.

Share this article!

Download article as PDF