A handful of major newspapers are in talks with OpenAI, the maker of ChatGPT, over access to a vital resource in the age of generative artificial intelligence: Digital news stories.
For years, tech companies like Open AI have freely used news stories to build data sets that teach their machines how to recognize and respond fluently to human queries about the world. But as the quest to develop cutting-edge AI models has grown increasingly frenzied, newspaper publishers and other data owners are demanding a share of the potentially massive market for generative AI, which is projected to reach to $1.3 trillion by 2032, according to Bloomberg Intelligence.
Since August, at least 535 news organizations — including the New York Times, Reuters and The Washington Post — have installed a blocker that prevents their content from being collected and used to train ChatGPT. Now, discussions are focused on paying publishers so the chatbot can surface links to individual news stories in its responses, a development that would benefit the newspapers in two ways: by providing direct payment and by potentially increasing traffic to their websites.
In July, Open AI cut a deal to license content from the Associated Press as training data for its AI models. The current talks also have addressed that idea, according to two people familiar with the talks who spoke on the condition of anonymity to discuss sensitive matters, but have concentrated more on showing stories in ChatGPT responses.
Other sources of useful data are also looking for leverage. Reddit, the popular social message board, has met with top generative AI companies about being paid for its data, according to a person familiar with the matter, speaking on the condition of anonymity to discuss private negotiations.
If a deal can’t be reached, Reddit is considering blocking search crawlers from Google and Bing, which would prevent the forum from being discovered in searches and reduce the number of visitors to the site. But the company believes the trade-off would be worth it, the person said, adding: “Reddit can survive without search.”
And in April, Elon Musk began charging $42,000 for bulk access to posts on Twitter — which previously had been free to researchers — after he claimed that AI companies had illegally used the data to train their models. (Musk has since rebranded Twitter as X.)
The moves mark a growing sense of urgency and uncertainty about who profits from online information. With generative AI poised to transform how users interact with the internet, many publishers and other companies see fair payment for their data as an existential issue.
For example, a month after OpenAI launched GPT-4 in March, traffic to the coding community Stack Overflow declined by 15 percent as programmers turned to AI for answers to their coding questions, according to CEO Prashanth Chandrasekar, who also told The Post he thought the AI had been trained on Stack Overflow’s data.
This week, the company laid off 28 percent of its staff.
In addition to demands for payment, leading AI firms are facing a slew of copyright lawsuits from individual book authors, artists and software coders seeking damages for infringement, as well as a share of profits. Late Wednesday, former Arkansas governor Mike Huckabee joined the fray as a plaintiff in a class-action lawsuit against Meta, Microsoft and Bloomberg for using AI tools with pirated books to train AI systems, Reuters reported. Trade groups, meanwhile, are pushing lawmakers for the right to bargain collectively with tech companies.
Open AI’s decision to negotiate may reflect a desire to strike deals before courts have a chance weigh in on whether tech companies have a clear legal obligation to license — and pay for — content, said James Grimmelmann, a professor of digital and information law at Cornell University, who recently helped organize a workshop on generative AI and the law at the International Conference on Machine Learning.
An OpenAI spokesperson confirmed that the company is in talks with the newspapers and that discussions were not focused on prior training data, which it argues was obtained legally. “None of the company’s practices have violated copyright law,” the spokesperson said. “Any deal would be for future access to content that is otherwise inaccessible or display uses that go beyond fair use.”
Nearly $16 billion in venture capital poured into generative AI in the first three quarters of 2023, according to the analytics firm PitchBook — a flood of cash that in part reflects how expensive the technology is to build. Every component is prohibitively pricey or hard to acquire, from hardware to computing power.
Until now, the only free and easy part had been the data. Widely used services like the nonprofit Common Crawl charge Google, Meta, OpenAI and others nothing to use its service, which crawls the internet in search of troves of online text and archives the information for others to download. To assemble the vast quantities of natural language and specialized information needed to train large AI systems, tech companies have combined those archives with online data sets, accessing information made available for research purposes, and increasingly straying from information clearly in the public domain.
Until recently, tech companies have been loath to pay for that data. At a listening session on generative AI hosted in April by the U.S. Copyright Office, Sy Damle, a lawyer representing the Silicon Valley venture capital firm Andreessen Horowitz, acknowledged that “the only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data.”
Even before OpenAI and Google released tools to block their AI data crawlers in August and September, huge online forums like Reddit, Stack Overflow and Wikipedia began defensive measures. The sites, which have long provided regular “data dumps” that made content easily available for AI training, now are developing or have launched paid portals for AI companies seeking training data and closely monitored limits on how often their site can be mined for data.
While Reddit, Stack Overflow and news organizations usher in what he called a new era of “data strikes,” Nicholas Vincent, a professor of computing science at Simon Fraser University in British Columbia, cautioned that publishers will have to find strength in numbers: AI operators “never, ever care about one person leaving,” he said.
NewsCorp chief executive Robert Thomson echoed that understanding at a news media conference in May when asked if he would like to announce a deal with the big digital players. “I wish,” Thomson said. “But it can’t just be us.”
Since then, the media conglomerate IAC, which owns The Daily Beast, tried building a coalition of publishers who aimed to win billions of dollars from AI companies through a lawsuit or legislative action, according to a July report in Semafor. In August, NPR reported that the New York Times was also considering a lawsuit against OpenAI.
In the current climate, the data holders best positioned to make a deal are still companies accustomed to asserting their intellectual property rights rather than individual artists, authors and coders, said Yacine Jernite, who leads the machine learning and society team at Hugging Face, an open source AI start-up.
For example, the stock photo site Shutterstock has a partnership to provide training data for OpenAI. Late last year, the company also launched a Contributor Fund to compensate artists whose work has been used to train AI models. An analysis by stock photographer Robert Kneschke estimated that the fund paid out more than $4 million in May — but the median payout was just $0.0069 per image. Shutterstock did not respond to request for comment.
Danielle Coffey, president and CEO of the News/Media Alliance (NMA), a trade group representing more than 2,000 publishers, said the White House and other policymakers have been receptive to the need for licensing deals. She recently organized a week of visits in Washington and various state capitals to advocate for copyright protections for publishers.
With generative AI, “what goes in, must come out,” Coffey said. “If quality content and quality journalism isn’t a part of that, then that is not a good thing for the products themselves — or for society.”
A previous verision of this story incorrectly reported that Reddit was considering putting its content behind a log-in page for the first time. This version has been corrected.