Mon. Dec 23rd, 2024
Cloudflare Offers An Easier Way To Stop Ai Bots –

Content delivery network Cloudflare is making it easier for customers who are fed up with bad bots to block them from their websites.

It has been said for some time that to prevent bots from crawling a company’s website,robotThere are “ ” files that list who is and who is not welcome, and content delivery networks such as Cloudflare provide visual interfaces that simplify the creation of such files.

But faced with the emergence of a new generation of badly behaved AI bots that scrape content to feed into large-scale language models, Cloudflare has introduced an even quicker way to block all such bots with just one click.

“The popularity of generative AI has led to a surge in demand for the content used to train models and perform inference, and while some AI companies have clearly identified web scraping bots, not all have been as transparent,” Cloudflare staff wrote in a blog post.

According to the article’s authors, “With Google reportedly paying $60 million a year to license Reddit’s user-generated content, Scarlett Johansson alleging that OpenAI used her voice for a new personal assistant without her consent, and more recently Perplexity being accused of scraping content from websites by masquerading as legitimate visitors, large amounts of original content have never been more valuable.”

Last year, Cloudflare introduced a way for customers on any plan to block certain categories of bots, including certain AI crawlers. Cloudflare said these bots monitor requests in a site’s robots.txt file and won’t use unauthorized content to train models or scrape it for feeds. Search Extension Generation (RAG) application.

To achieve this, bots are identified by a “user agent string,” a sort of business card presented by browsers, bots, and other tools that request data from web servers.

“Despite these AI bots playing by the rules, Cloudflare customers are overwhelmingly choosing to block them. We’ve heard loud and clear: customers don’t want AI bots visiting their websites, especially the ones that are engaging in abusive behavior,” the post read.

According to the company, the top four AI web crawlers visiting Cloudflare-protected sites were Bytespider, Amazonbot, ClaudeBot, and GPTBot. Bytespider, the most frequent visitor, is operated by ByteDance, the Chinese company that owns TikTok. It reportedly visited 40.4% of protected websites and is used to collect training data for large language models (LLMs), including the one that powers ChatGPT’s rival Doubao. Amazon Bot It’s reportedly being used to index content to help Amazon’s Alexa chatbot answer questions, while ClaudeBot collects data for Anthropic’s AI assistant, Claude.

Blocking bad bots

Blocking bots based on their user-agent strings only works if the bots tell the truth about their identity, but there are indications that this is not the case for all bots, or even all the time.

In those cases, other measures are needed. Thomas Randall, director of AI market research at Infotech Research Group, said that the primary recourse companies have in dealing with unwanted web scraping is usually reactive, meaning they take legal action.

“While some anti-web scraping software applications exist (e.g. DataDome and Cloudflare), their effectiveness is limited. If an AI bot scrapes a site infrequently, the bot may go undetected,” he said in an email.

To justify legal action against the operators of bad bots, companies will need to do more than just claim that the bots failed to leave when asked.

“It’s best for companies to hide intellectual property and other sensitive information behind paid membership content,” Randall said. “Scraping on paid content is subject to legal action and carries a clearly restrictive copyright license on the site, so organizations should be prepared to take legal action. Scraping on public sites should be an acceptable part of an organization’s risk tolerance.”

Randall noted that if an organization has the resources to go further, they could consider rate limiting connections to their site, temporarily automatically blocking suspicious IP addresses, forcing a human interaction by limiting information about why access was blocked to a message such as “If you need help, please contact support at [email protected],” and reviewing how available the website is on mobile sites and apps.

“Ultimately, you can’t stop the scraping, you can only hinder it at best,” he said.