“There is a serious bottleneck here.”
Devourer
Researchers are ringing the alarm bells, warning that companies like OpenAI and Google are rapidly running out of human-written training data for their AI models.
And without new training data, it’s likely the models won’t be able to get any smarter, a point of reckoning for the burgeoning AI industry.
“There is a serious bottleneck here,” AI researcher Tamay Besiroglu, lead author of a new paper to be presented at a conference this summer, told the Associated Press. “If you start hitting those constraints about how much data you have, then you can’t really scale up your models efficiently anymore.”
“And scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output,” he added.
Feed Me
It’s an existential threat for AI tools that rely on feasting on copious amounts of data, which has often indiscriminately been pulled from publicly available archives online.
The controversial trend has already led to publishers, including the New York Times, suing OpenAI over copyright infringement for using their material to train AI models.
And as companies continue to lay off workers while making major investments in AI, the stream of new content could soon turn into a trickle.
The latest paper, authored by researchers at San Francisco-based think tank Epoch, suggests that the sheer amount of text data AI models are being trained on is growing roughly 2.5 times a year. Meanwhile, computing has outpaced that considerably, growing four times a year.
Extrapolated on a graph, that means large language models like Meta’s Llama 3 or OpenAI’s GPT-4 could entirely run out of fresh data as soon as 2026, the researchers argue.
AI Ouroboros
Once AI companies do run out of training data, something that’s been predicted by other researchers as well, they’re likely to try training their large language models on AI-generated data instead. Outfits including OpenAI, Google, and Anthropic are already working on ways to generate “synthetic data” for this purpose.
It’s not clear that will work, according to experts. In a paper last year, scientists at Rice and Stanford University found that feeding their models AI-generated content causes their output quality to erode. Both language models and image generators could be sent down an “autophagous loop,” the AI equivalent of a snake eating its own tail.
But whether any of this will actually become a problem remains a subject of debate. For one, we’d be perfectly fine without wasting copious amounts of energy and water on training these AIs.
And it’s possible that AI algorithms themselves will become more efficient, producing better outputs with less training data or computing power.
“I think it’s important to keep in mind that we don’t necessarily need to train larger and larger models,” AI researcher and University of Toronto assistant professor of computer engineering Nicolas Papernot, who was not involved in the study, told the AP.
More on AI: Sam Altman Admits That OpenAI Doesn’t Actually Understand How Its AI Works