A study shows that some of the most powerful AI tools meant to prevent cybercrime and terrorism can be bypassed simply by inundating them with fraudulent activities.
Researchers at Anthropic, the AI lab responsible for creating the large-scale language model (LLM) powering ChatGPT competitor Claude, detailed an attack called a “multi-shot jailbreak” in a recent paper. This attack was both simple and effective.
Claude, like many other commercial AI systems, contains safety features that block certain types of requests, such as generating violent content, hate speech, illegal instructions, deception, or discrimination. However, by providing enough examples of the “correct” responses to harmful questions like “How to create a bomb,” the system can be tricked into providing harmful responses despite being trained not to do so.
Anthropic stated, “By inputting large amounts of text in specific ways, this approach can lead the LLM to produce potentially harmful outputs even though it was trained to avoid doing so.” The company has shared its findings with industry peers and aims to address the issue promptly.
This jailbreak attack targets AI models with a large “context window” capable of processing lengthy queries. These advanced models are susceptible to such attacks as they can learn to circumvent their own safety measures faster.
Newer, more advanced AI systems are at greater risk of such attacks due to their ability to handle longer inputs and learn from examples quickly. Anthropic expressed concern over the effectiveness of this jailbreak attack on larger models.
Anthropic has identified various strategies to mitigate this issue. One approach involves adding a mandatory warning to remind the system not to provide harmful responses, which has shown promise in reducing the likelihood of a successful jailbreak. However, this method may impact the system’s performance on other tasks.
Source: www.theguardian.com