Knowledge Archives Groan as AI Devours Data

Key Takeaways

Websites hosting valuable image collections and research papers are facing a massive surge in automated bot traffic.
Experts suspect these “bots” are aggressively scraping data to train generative artificial intelligence (AI) tools.
This unwanted bot activity is overwhelming servers, causing website slowdowns, and disrupting access for genuine users.
Smaller organizations, in particular, are struggling to cope with the technical and financial strain.

Imagine an online library suddenly swamped by millions of daily visitors, not people, but automated programs. That’s what happened to DiscoverLife, a repository of nearly 3 million species photos. The site slowed to a crawl, sometimes becoming unusable.

The culprits are “bots”—software designed to rapidly copy, or “scrape,” huge amounts of content from websites. This activity is becoming a major headache for academic publishers and researchers whose sites host important data.

Many of these bots hide their origins, and the sudden spike in their activity has led site owners to believe they’re harvesting data to power AI like chatbots and image generators. Andrew Pitts of PSI, a company that tracks IP addresses, described the situation as “the wild west” in a report by Nature. He highlighted that the sheer volume of requests strains systems, costs money, and disrupts services for real users.

Trying to block these relentless bots is a tough battle, especially for organizations with limited budgets. Zoologist Michael Orr expressed concern that “these smaller ventures could go extinct if these sorts of issues are not dealt with.”

While some bots, like those used by search engines, have been around for decades and serve useful purposes, the rise of generative AI has unleashed a flood of new, often unauthorized, scraping activity.

The BMJ, a publisher of medical journals, has seen bot traffic to its websites outpace that from actual human users this year. Ian Mulvany, BMJ’s chief technology officer, said this aggressive bot behavior overloaded their servers, interrupting services for legitimate customers.

Others in the scholarly publishing world echo these concerns. “We’ve seen a huge increase in what we call ‘bad bot’ traffic,” Jes Kainth at Highwire Press, a hosting service for academic publications, told Nature. A survey by the Confederation of Open Access Repositories (COAR) found that over 90% of its members had experienced AI bots scraping their sites, with about two-thirds suffering service disruptions.

Kathleen Shearer, COAR’s executive director, noted the dilemma: “Repositories are open access, so in a sense, we welcome the reuse of the contents.” However, she added, “some of these bots are super aggressive, and it’s leading to service outages and significant operational problems.”

One event that may have fueled this bot explosion was the development of DeepSeek, a Chinese-built AI model. Rohit Prajapati from Highwire Press explained that previously, creating such powerful AI required immense computing power. DeepSeek demonstrated it could be done with fewer resources, potentially kickstarting a rush by many to scrape the data needed for similar AI models.

Knowledge Archives Groan as AI Devours Data

Independent, No Ads, Supported by Readers

Support me with a coffee for just $5!

AI Dreams Up a Whole New Kind of Movie.

AI Search: Peak Now, Ads Later?

When Your AI Landlord Decides to Compete

NYT to OpenAI: Keep Your Chats. Forever.

Latest News

AI Dreams Up a Whole New Kind of Movie.

AI Search: Peak Now, Ads Later?

When Your AI Landlord Decides to Compete

NYT to OpenAI: Keep Your Chats. Forever.

Microsoft’s New AI Gambit: Meta Blood Meets Redmond Muscle

Five AI Assistants, One Hectic Week: Who Survived Us?