Academic publishers are selling access to research papers to technology firms to train artificial-intelligence (AI) models. Some researchers have reacted with dismay at such deals happening without the consultation of authors. The trend is raising questions about the use of published and sometimes copyrighted work to train the exploding number of AI chatbots in development.
Experts say that, if a research paper hasn’t yet been used to train a large language model (LLM), it probably will be soon. Researchers are exploring technical ways for authors to spot if their content is being used.
AI models fed AI-generated data quickly spew nonsense. Last month, it emerged that the UK academic publisher Taylor & Francis had signed a US$10-million deal with Microsoft, allowing the US technology company to access the publisher’s data to improve its AI systems. In June, an investor update showed that US publisher Wiley had earned $23 million from allowing an unnamed company to train generative-AI models on its content.
### The Value of Research Papers in AI Training
LLMs train on huge volumes of data, frequently scraped from the Internet. They derive patterns between billions of snippets of language in the training data, known as tokens, that allow them to generate text with uncanny fluency. Generative-AI models rely on absorbing patterns from these swathes of data to output text, images, or computer code. Academic papers are valuable for LLM builders owing to their length and “high information density,” says Stefan Baack, who analyzes AI training data sets at the Mozilla Foundation.
Training models on a large body of scientific information also gives them a much better ability to reason about scientific topics. Lucy Lu Wang, an AI researcher at the University of Washington, co-created a data set based on 81.1 million academic papers. The data set was originally developed for text mining but has since been used to train LLMs.
### The Growing Trend of Buying High-Quality Data Sets
The trend of buying high-quality data sets is growing. This year, the Financial Times has offered its content to ChatGPT developer OpenAI in a lucrative deal, as has the online forum Reddit to Google. Given that scientific publishers probably view the alternative as their work being scraped without an agreement, more of these deals are expected in the future.
Some AI developers intentionally keep their data sets open, while many firms developing generative-AI models have kept much of their training data secret. Open-source repositories like arXiv and the scholarly database PubMed are thought to be popular sources, although paywalled journal articles likely have their free-to-read abstracts scraped by big technology firms.
### Methods to Check Data in Training Sets
Proving that an LLM has used any individual paper is difficult. One method is to prompt the model with an unusual sentence from a text and see whether the output matches the next words in the original. Yves-Alexandre de Montjoye, a computer scientist at Imperial College London, has developed a version of membership inference attack called a copyright trap for LLMs.
Even if it were possible to prove that an LLM has been trained on a certain text, it is not clear what happens next. Publishers maintain that using copyrighted text in training without seeking a license counts as infringement. Litigation might help to resolve these copyright questions.
Many academics are happy to have their work included in LLM training data, especially if the models make them more accurate. However, individual scientific authors currently have little power if the publisher of their paper decides to sell access to their copyrighted works. For publicly available articles, there is no established means to apportion credit or know whether a text has been used.
Some researchers are frustrated with the lack of control over their work in the AI training landscape. “We want LLMs, but we still want something that is fair, and I think we’ve not invented what this looks like yet,” says de Montjoye.