The landscape of the internet is shifting, moving away from traditional users and towards automated web-browsing bots. A recent report indicates that, for the first time this year, non-human web browsing bots make up the majority of all traffic.
Alarmingly, over half of this bot traffic stems from malicious sources, including those harvesting unsecured personal data online. Yet, there’s a rising trend in bots designed by artificial intelligence companies, aimed at gathering data for model training and responding to user interactions. Notably, OpenAI’s ChatGPT-User accounts for 6% of total web traffic, while Claudebot, created by Anthropic, represents 13%.
AI firms argue that data scraping is crucial for keeping their models updated, while content creators voice concerns about these bots being tools for vast copyright violations. Earlier this year, Disney and Universal took legal action against AI company Midjourney, claiming that its image generators were reproducing characters from popular franchises such as Star Wars and Despicable Me.
Given that most creators lack the financial means for prolonged legal battles, many have turned to innovative methods to protect their content. They implement online tools that complicate AI bot scraping, with methods like misleading bots, causing AI to confuse images of cars with cows. While this “AI addiction” tactic helps safeguard creators’ work, it may also introduce new risks on the web.
Copyright Concerns
Historically, imitators have profited off artists’ work, which is primarily why intellectual property and copyright laws exist. The advent of AI image generators like Midjourney and OpenAI’s DALL-E has exacerbated this issue.
A key concern in the U.S. is the fair use doctrine, allowing limited usage of copyrighted materials without permission under certain circumstances. While fair use laws are designed to be flexible, they hinge on the principle of creating something new from the original work.
Many artists and advocates believe that AI technologies blur the lines between fair use and copyright infringement, harming content creators. For example, while drawing an image of Mickey Mouse in The Simpsons universe for personal use may be harmless, AI can rapidly produce and circulate similar images, complicating the transformative aspect and often leading to commercial exploitation.
In an effort to protect their commercial interests, some U.S. creators have pursued legal action, with Disney and Universal’s lawsuits against Midjourney being among the latest examples. Other notable cases include an ongoing legal dispute involving the New York Times and OpenAI regarding alleged misuse of newspaper stories.
Disney sues Midjourney over its image generator.
Photo 12/Alamy
AI companies firmly deny any wrongdoing, asserting that data scraping is permissible under the fair use doctrine. In an open letter to the US Bureau of Science and Technology Policy in March, OpenAI’s Chief Global Affairs Officer, Chris Lehane, cautioned against strict copyright regulations elsewhere in the world. Recent attempts to enhance copyright protections for creators have been critiqued for potentially stifling innovation and investment. OpenAI previously claimed it was “impossible” to develop AI models catering to user needs without referencing copyrighted work. Google shares a similar stance, emphasizing that copyright, privacy, and patent laws create barriers to accessing necessary training data.
For now, public sentiment seems to align with the activists’ viewpoint. Analysis of public feedback on copyright and AI inquiries by the U.S. Copyright Office reveals that 91% of comments expressed negative sentiments regarding AI.
The lack of public sympathy for AI firms is attributed to the overwhelming traffic their bots create, which can strain resources and may even take some websites offline—and the content creators feel powerless to stop them. While there are methods to exclude content-crawling bots, like tweaking a small file on a website to prevent bot access, these requests are sometimes ignored.
Combatting AI Data Addiction
Consequently, new tools have emerged, empowering content creators to better shield their work from AI bots. This year, CloudFlare, an internet infrastructure company known for protecting users from distributed denial-of-service (DDoS) attacks, launched technologies to combat harmful AI bots. Their approach involves generating a labyrinth of AI-generated pages filled with nonsensical content, effectively distracting AI bots from accessing genuine information.
A tool called AI Labyrinth is designed to manage 50 billion requests per day from AI crawlers, according to CloudFlare. The objective of AI Labyrinth is to “slow, confuse, and waste the resources of AI crawls and other bots that disregard the ‘no crawl’ directive.” Following this, CloudFlare introduced another tool that compels AI companies to pay for accessing their websites or restricts raw content usage.
An alternative strategy involves allowing AI bots to access online content while subtly “poisoning” it, rendering the data less useful. Tools like Glaze and Nightshade, developed at the University of Chicago, serve as a focal point of resistance. Both tools are freely available for download from the university’s website.
Since its 2022 launch, Glaze defends by introducing imperceptible pixel-level modifications, or “style cloaks,” to artists’ works, causing AI models to misidentify art styles (e.g., interpreting watercolors as oil paintings). Launched in 2023, Nightshade degrades image data in a way that leads AI models to create incorrect associations, such as linking the word “cat” with images of dogs. Both tools have been downloaded over 10 million times.
Nightshade Tool alters AI perceptions of images.
Ben Y. Zhao
Tools designed to combat AI data addiction are empowering artists, according to Ben Zhao, a senior researcher at the University of Chicago involved with both Glaze and Nightshade. “These companies have trillion-dollar market caps, and they essentially take what they want,” he asserts.
Using tools like these allows artists to exert more control over the use of their creations. “Glaze and Nightshade are interesting, innovative tools that demonstrate effective strategies that don’t rely on changing regulations,” explains Jacob Hoffman Andrews from the Electronic Frontier Foundation, a U.S.-based digital rights nonprofit.
Self-sabotaging content to deter copycats is an old strategy, notes Eleonora Rosati from Stockholm University. “For instance, cartographers might include fictitious place names, making them evidence of plagiarism if rivals replicate them. A similar tactic was noted in music, where the lyrics website Genius claimed to have embedded unique apostrophes to prove Google’s unlicensed use of their content. Google denies this claim, and the lawsuit was dismissed.
The term “sabotage” raises eyebrows, says Hoffman Andrews. “I don’t view it as disruptive; these artists are modifying their content, which they have every right to do.”
It remains uncertain how many unique measures AI firms are implementing to handle data tainted by these defensive tactics, yet Zhao’s findings indicate that 85% of these methods maintain their efficacy, suggesting AI companies may deem dealing with manipulated data more troublesome than it’s worth.
Disseminating Misinformation
Interestingly, it’s not just artists experimenting with data poisoning tactics; some nation-states might employ similar strategies to disseminate false narratives. The Atlantic Council, a U.S.-based think tank, recently revealed that the Russian Pravda News Network has attempted to manipulate AI bots to spread misinformation.
This operation reportedly involves flooding the internet with millions of web pages masquerading as legitimate news articles, aiming to boost Kremlin narratives regarding the Ukraine war. A recent analysis by NewsGuard, which monitors Pravda’s activities, found that 10 out of 10 major AI chatbots have output text aligning with Pravda’s viewpoints.
The effectiveness of these tactics emphasizes the challenges inherent in AI technology: the methods employed by well-intentioned actors can inevitably be hijacked by those with malicious intent.
However, solutions do exist, asserts Zhao, though they may not align with AI companies’ interests. Rather than arbitrarily collecting online data, AI firms could establish formal agreements with legitimate content providers to ensure their models are trained on reliable data. Yet, such arrangements come with costs, leading Zhao to remark, “Money is at the heart of this issue.”
Topics:
- artificial intelligence/
- chatgpt
Source: www.newscientist.com
