Intellectual property and data privacy: the hidden risks of AI

Close-up of a person holding a phone showing the ChatGPT application — Although ChatGPT and other generative AI chatbots are transformative tools, risks to privacy and content ownership are baked in.Credit: Jaap Arriens/NurPhoto/Getty

Timothée Poisot, a computational ecologist at the University of Montreal in Canada, has made a successful career out of studying the world’s biodiversity. A guiding principle for his research is that it must be useful, Poisot says, as he hopes it will be later this year, when it joins other work being considered at the 16th Conference of the Parties (COP16) to the United Nations Convention on Biological Diversity in Cali, Colombia. “Every piece of science we produce that is looked at by policymakers and stakeholders is both exciting and a little terrifying, since there are real stakes to it,” he says.

But Poisot worries that artificial intelligence (AI) will interfere with the relationship between science and policy in the future. Chatbots such as Microsoft’s Bing, Google’s Gemini and ChatGPT, made by tech firm OpenAI in San Francisco, California, were trained using a corpus of data scraped from the Internet — which probably includes Poisot’s work. But because chatbots don’t often cite the original content in their outputs, authors are stripped of the ability to understand how their work is used and to check the credibility of the AI’s statements. It seems, Poisot says, that unvetted claims produced by chatbots are likely to make their way into consequential meetings such as COP16, where they risk drowning out solid science.

Nature Career Guide: Faculty

“There’s an expectation that the research and synthesis is being done transparently, but if we start outsourcing those processes to an AI, there’s no way to know who did what and where the information is coming from and who should be credited,” he says.

Since ChatGPT’s arrival in November 2022, it seems that there’s no part of the research process that chatbots haven’t touched. Generative AI (genAI) tools can now perform literature searches; write manuscripts, grant applications and peer-review comments; and even produce computer code. Yet, because the tools are trained on huge data sets — that often are not made public — these digital helpers can also clash with ownership, plagiarism and privacy standards in unexpected ways that cannot be addressed under current legal frameworks. And as genAI, overseen mostly by private companies, increasingly enters the public domain, the onus is often on users to ensure that they are using the tools responsibly.

Bot bounty

The technology underlying genAI, which was first developed at public institutions in the 1960s, has now been taken over by private companies, which usually have no incentive to prioritize transparency or open access. As a result, the inner mechanics of genAI chatbots are almost always a black box — a series of algorithms that aren’t fully understood, even by their creators — and attribution of sources is often scrubbed from the output. This makes it nearly impossible to know exactly what has gone into a model’s answer to a prompt. Organizations such as OpenAI have so far asked users to ensure that outputs used in other work do not violate laws, including intellectual-property and copyright regulations, or divulge sensitive information, such as a person’s location, gender, age, ethnicity or contact information. Studies have shown that genAI tools might do both¹^,².

Chatbots are powerful in part because they have learnt from nearly all the information on the Internet — obtained through licensing agreements with publishers such as the Associated Press and social-media platforms including Reddit, or through broad trawls of freely accessible content — and they excel at identifying patterns in mountains of data. For example, the GPT-3.5 model, which underlies one version of ChatGPT, was trained on roughly 300 billion words, which it uses to create strings of text on the basis of predictive algorithms.

The Supreme Court of the United States is seen through fencing — The approach to AI regulation is likely to differ between the United States and Europe.Credit: Amanda Andrade-Rhoades for The Washington Post/Getty

AI companies are increasingly interested in developing products marketed to academics. Several have released AI-powered search engines. In May, OpenAI announced ChatGPT Edu, a platform that layers extra analytical capabilities onto the company’s popular chatbot and includes the ability to build custom versions of ChatGPT.

Two studies this year have found evidence of widespread genAI use to write both published scientific manuscripts³ and peer-review comments⁴, even as publishers attempt to place guardrails around the use of AI by either banning it or asking writers to disclose whether and when AI is used. Legal scholars and researchers who spoke to Nature made it clear that, when academics use chatbots in this way, they open themselves up to risks that they might not fully anticipate or understand. “People who are using these models have no idea what they’re really capable of, and I wish they’d take protecting themselves and their data more seriously,” says Ben Zhao, a computer-security researcher at the University of Chicago in Illinois who develops tools to shield creative work, such as art and photography, from being scraped or mimicked by AI.

When contacted for comment, an OpenAI spokesperson said the company was looking into ways to improve the opt-out process. “As a research company, we believe that AI offers huge benefits for academia and the progress of science,” the spokesperson says. “We respect that some content owners, including academics, may not want their publicly available works used to help teach our AI, which is why we offer ways for them to opt out. We’re also exploring what other tools may be useful.”

In fields such as academia, in which research output is linked to professional success and prestige, losing out on attribution not only denies people compensation, but also perpetuates reputational harm. “Removing peoples’ names from their work can be really damaging, especially for early-career scientists or people working in places in the global south,” says Evan Spotte-Smith, a computational chemist at Carnegie Mellon University in Pittsburgh, Pennsylvania, who avoids using AI for ethical and moral reasons. Research has shown that members of groups that are marginalized in science have their work published and cited less frequently than average⁵, and overall have access to fewer opportunities for advancement. AI stands to further exacerbate these challenges, Spotte-Smith says: failing to attribute someone’s work to them “creates a new form of ‘digital colonialism’, where we’re able to get access to what colleagues are producing without needing to actually engage with them”.

Portrait of Evan Spotte-Smith — Computational chemist Evan Spotte-Smith avoids using AI tools for ethical reasons.Credit: UC Berkeley Engineering Student Services

Academics today have little recourse in directing how their data are used or having them ‘unlearnt’ by existing AI models⁶. Research is often published open access, and it is more challenging to litigate the misuse of published papers or books than that of a piece of music or a work of art. Zhao says that most opt-out policies “are at best a hope and a dream”, and many researchers don’t even own the rights to their creative output, having signed them over to institutions or publishers that in turn can enter partnerships with AI companies seeking to use their corpus to train new models and create products that can be marketed back to academics.

Representatives of the publishers Springer Nature, the American Association for the Advancement of Science (which publishes the Science family of journals), PLOS and Elsevier say they have not entered such licensing agreements — although some, including those for the Science journals, Springer Nature and PLOS, noted that the journals do disclose the use of AI in editing and peer review and to check for plagiarism. (Springer Nature publishes Nature, but the journal is editorially independent from its publisher.)

Other publishers, such as Wiley and Oxford University Press, have brokered deals with AI companies. Taylor & Francis, for example, has a US$10-million agreement with Microsoft. The Cambridge University Press (CUP) has not yet entered any partnerships, but is developing policies that will offer an ‘opt-in’ agreement to authors, who will receive remuneration. In a statement to The Bookseller magazine discussing future plans for the CUP — which oversees 45,000 print titles, more than 24,000 e-books and more than 300 research journals — Mandy Hill, the company’s managing director of academic publishing, who is based in Oxford, UK, said that it “will put authors’ interests and desires first, before allowing their work to be licensed for GenAI”.

Some authors are unsettled by the news that their work will be fed into AI algorithms (see ‘How to protect your intellectual property from AI’). “I don’t feel confident that I can predict all the ways AI might impact me or my work, and that feels frustrating and a little frightening,” says Edward Ballister, a cancer biologist at Columbia University in New York City. “I think institutions and publishers have a responsibility to think about what this all means and to be open and communicative about their plans.”

How to protect your intellectual property from AI

New laws will ultimately establish more robust expectations around ownership and transparency of the data used to train generative AI (genAI) models. Meanwhile, there are a few steps that researchers can take to protect their intellectual property (IP) and safeguard sensitive data.

1. Think critically about whether AI is truly needed.

Abstaining from using genAI might feel like missing out on a golden opportunity. But for certain disciplines — particularly those that involve sensitive data, such as medical diagnoses — giving it a miss could be the more ethical option. “Right now, we don’t really have a good way of making AI forget, so there are still a lot of constraints on using these models in health-care settings,” says Uri Gal, an informatician at the University of Sydney in Australia, who studies the ethics of digital technologies.

2. If you do use AI, implement safeguards.

Specialists broadly agree that it’s nearly impossible to completely shield your data from web scrapers, tools that extract data from the Internet. However, there are some steps — such as hosting data locally on a private server or making resources open and available, but only by request — that can add an extra layer of oversight. Several companies, including OpenAI, Microsoft and IBM, allow customers to create their own chatbots, trained on their own data, that can be isolated in this way.

3. When possible, opt out.

The enforceability of opt-out policies that omit data from AI training sets varies widely, but companies such as Slack, Adobe, Quora, Squarespace, Substack and OpenAI all offer options to prevent content from being scraped. However, some platforms make the process more challenging than others or limit the option to certain types of account. If you’re good at coding, you can modify your personal website’s robots.txt file, which tells web crawlers whether they are allowed to visit your page, to keep the tools from scraping your content.

4. If you can, ‘poison’ your data.

Scientists can now detect whether visual products, such as images or graphics, have been included in a training set, and have developed tools that can ‘poison’ data such that AI models trained on them break in unpredictable ways. “We basically teach the models that a cow is something with four wheels and a nice fender,” says Ben Zhao, a computer-security researcher at the University of Chicago in Illinois. Zhao worked on one such tool, called Nightshade, which manipulates the individual pixels of an image so that an AI model associates the corrupted pattern with a different type of image (a dog instead of a cat, for example). Unfortunately, there are not yet similar tools for poisoning writing.

5. Voice your concerns.

Academics often sign their IP over to institutions or publishers, giving them less leverage in deciding how their data are used. But Christopher Cornelison, the director of IP development at Kennesaw State University in Georgia, says it’s worth starting a conversation with your institution or publisher if you have concerns. These entities could be better placed to broker a licensing agreement with an AI company or pursue litigation when infringement seems likely to happen. “We certainly don’t want an adversarial relationship with our faculty, and the expectation is that we’re working towards a common goal,” he says.

Some evidence suggests that publishers are noting scientists’ discomfort and acting accordingly, however. Daniel Weld, chief scientist at the AI search engine Semantic Scholar, based at the University of Washington in Seattle, has noticed that more publishers and individuals are reaching out to retroactively request that papers in the Semantic Scholar corpus not be used to train AI models.

The law weighs in

International policy is only now catching up with the burst of AI technology, and clear answers to foundational questions — such as where AI output falls under existing copyright legislation, who owns that copyright and what AI companies need to consider when they feed data into their models — are probably years away. “We are now in this period where there are very fast technological developments, but the legislation is lagging,” says Christophe Geiger, a legal scholar at Luiss Guido Carli University in Rome. “The challenge is how we establish a legal framework that will not disincentivize progress, but still take care of our human rights.”

Portrait of Dragoş Tudorache — Dragoş Tudorache was instrumental in designing the world’s first comprehensive AI legislation, the EU AI Act.Credit: European Parliament

Even as observers settle in for what could be a long wait, Peter Yu, an intellectual-property lawyer and legal scholar at Texas A&M University School of Law in Fort Worth, says that existing US case law suggests that the courts will be more likely to side with AI companies, in part because the United States often prioritizes the development of new technologies. “That helps push technology to a high level in the US when a lot of other countries are still trying to catch up, but it makes it more challenging for creators to pursue suspected infringement.”

The European Union, by contrast, has historically favoured personal protections over the development of new technologies. In May, it approved the world’s first comprehensive AI law, the AI Act. This broadly categorizes uses of AI on the basis of their potential risks to people’s health, safety or fundamental rights, and mandates corresponding safeguards. Some applications, such as using AI to infer sensitive personal details, will be banned. The law will be rolled out over the next two years, coming into full effect in 2026, and applies to models operating in the EU.

The impact of the AI Act on academia is likely to be minimal, because the policy gives broad exemptions for products used in research and development. But Dragoş Tudorache, a member of the European Parliament and one of the two lead negotiators of the AI Act, hopes the law will have trickle-down effects on transparency. Under the act, AI companies producing “general purpose” models, such as chatbots, will be subject to new requirements, including an accounting of how their models are trained and how much energy they use, and will need to offer opt-out policies and enforce them. Any group that violates the act could be fined as much as 7% of its annual profits.

Tudorache sees the act as an acknowledgement of a new reality in which AI is here to stay. “We’ve had many other industrial revolutions in the history of mankind, and they all profoundly affected different sectors of the economy and society at large, but I think none of them have had the deep transformative effect that I think AI is going to have,” he says.

Daily News