Crisis Looms as AI Companies Rapidly Losing Access to Training Data

The consequences have come home to roost.

Data Crash

AI companies typically build their AI models on lots of publicly available content, from YouTube videos to newspaper articles. But many of these content hosts have now started to put up restrictions on their content.

Those new restrictions could bring about a “crisis” that would make these AI models less effective, according to a new study by the Massachusetts Institute of Technology’s Data Provenance Initiative.

The researchers performed an audit of 14,000 websites that are scraped by prominent AI training data sets. The intriguing result: that about 28 percent “of the most actively maintained, critical sources” on the internet are now “fully restricted from use.”

The administrators of these websites have made these restrictions by adding increasingly stringent limitations to how web crawler bots are allowed to scrape their content.

“If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems,” the researchers write.

No Free Lunch

It’s understandable that content hosts would put restrictions on their cache of now-valuable data.

AI companies have taken this publicly available material, much of it copyrighted, and are using it to make money without permission. This has understandably upset many, from The New York Times to celebrities like Sarah Silverman.

What’s particularly galling is that people like OpenAI CTO Mira Murati are saying that some creative jobs should disappear — even though it’s the content made by these creative people that power models like OpenAI’s ChatGPT.

The arrogance on display, and the resulting blowback, have created a “consent in crisis,” as the study researchers call it — meaning the once free-willing internet with no walls is becoming a thing of the past, and AI models will be more biased, less diverse and less fresh.

Some companies are now hoping to work around these constraints by using synthetic data, which is essentially data generated by AI, but so far that’s been a poor substitute to original content produced by actual human beings.

Others, like OpenAI, have struck deals with media companies, but many have expressed alarm at these agreements — for good reason, because the goals of tech companies and media outfits are at odds.

Time will tell how the whole thing shakes out. One thing’s for sure, though: stockpiles of training data are becoming more valuable — and scarce — than ever.

More on AI: Even Google’s Own Researchers Admit AI Is Top Source of Misinformation Online