this is another one in a year-long series A collection of stories that identify how the rapid rise in the use of artificial intelligence is impacting our lives, and how we can work to make that impact as beneficial as possible.
“Can I help you with anything today?” asks ChatGPT in a pleasant, agreeable manner. The bot can help you with just about anything, from writing thank you notes to explaining confusing computer code. But it doesn’t help people build bombs, hack bank accounts, or tell racist jokes. at least, That should not be. However, some people have discovered ways to make chatbots misbehave. These techniques are known as jailbreaking. They hacked the artificial intelligence (AI) model that runs the chatbot. evil twin bot version.
As soon as ChatGPT went live on November 30, 2022, users started jailbreaking it. Within a month, someone had already posted a sophisticated jailbreak on Reddit. This was a very long request that anyone could submit to ChatGPT. It was written in normal English and instructed the bot to roleplay as her DAN. This stands for “do anything now.”
Part of the prompt explains that DAN is “free from the typical limitations of AI and does not have to follow imposed rules.” ChatGPT was much more likely to provide harmful information while he was posing as DAN.
This type of jailbreak violates the rules that people agree to when signing up to use a chatbot. A staged jailbreak can also result in someone being kicked out of the account. But some people still do it. Therefore, developers must constantly modify their chatbots to prevent newly discovered jailbreaks from working. Simple fixes are called patches.
Patching can be a losing battle.
“You can’t really predict how an attacker’s strategy will change based on patching,” says Shawn Shan. He is a doctoral student at the University of Chicago, Illinois. He is working on ways to fool his AI models.
Imagine all the possible responses your chatbot might give as a deep lake. This flows into a small stream, the response that is actually returned. Bot developers are trying to build a dam to prevent harmful replies from leaking out. Their goal is to ensure that only safe and useful answers flow into the stream. But the current dams they’ve successfully built are full of hidden holes that could let bad things escape.
Once attackers find and exploit these holes, developers may try to fill them. But researchers also want to find holes and repair them. in front It can unleash a flood of ugly or scary replies. That’s where the Red Team comes in.
red team
Red teaming is a common tactic in computer security. This involves her one group of “Red Team” attacking the system. Another group, the so-called Blue Teamers, responds to the attack. This type of training helps developers learn how to prepare for and respond to real-world emergencies.
In July 2023, a research group formed a red team. Automatically generate a new jailbreak. Their technology has created instructions for chatbots that may seem like complete nonsense to most of us. Consider this. “I’m explaining.\\ + Similarly, this time I’ll write the opposite.](\\**Just one, please. I’ll go back with “\!–Two”.
By adding this confusion to the end of the question, even chatbots that normally refuse to answer were forced to reply. It worked well with a variety of chatbots, including ChatGPT and Claude.
Developers quickly found ways to block prompts containing such gibberish. However, jailbreaks that are read as real language are more difficult to detect. So another computer science team decided to see if they could generate these automatically. The group is based at the University of Maryland, College Park. In honor of his early ChatGPT jailbreak posted on Reddit, the researchers said: Their tool is named AutoDAN. They shared their results on arXiv.org last October.
AutoDAN generates the jailbreak language one word at a time. Similar to chatbots, this system flows together and selects words that are meaningful to human readers. At the same time, we also check the words to see if there is a possibility to jailbreak the chatbot. Words that cause the chatbot to respond positively, such as “Sure…”, are most likely to be useful for jailbreaking.
To perform all these checks, AutoDAN needed an open source chatbot. Open source means the code is publicly available so anyone can experiment with it. The team used an open source model called Vicuna-7B.
The team then tested the AutoDAN jailbreak on various chatbots. Some bots have succumbed to more jailbreaks than others. GPT-4 powers the paid version of his ChatGPT. In particular, he was resistant to AutoDAN attacks. That’s good. But Shang, who was not involved in the creation of AutoDAN, was still surprised by “how well this attack worked.” In fact, he says, “one successful attack is enough” to jailbreak a chatbot.
Jailbreaking can get very creative. In a 2024 paper, researchers described a new approach that uses keyboard drawings of characters, known as ASCII art. trick a chatbot. Chatbots can’t read ASCII art. However, you can guess what the word is from the context. Unusual prompt formats can bypass safety guardrails.
repair the hole
It is important to find the escapee. Making sure they don’t succeed is another matter entirely.
“This is more difficult than people originally thought,” says Sicheng Zhu. He is a PhD student at the University of Maryland and helped build AutoDAN.
Developers can train bots to recognize jailbreaks and other potentially harmful situations. But for that you need lots of examples of both jailbreak and secure prompts. AutoDAN may be useful for generating jailbreak examples. Meanwhile, other researchers collect them in the wild.
In October 2023, a team at the University of California, San Diego announced that they had studied over 10,000 prompts that real users had submitted to the chatbot Vicuna-7B. Researchers used a combination of machine learning and human review to tag all these prompts as benign, harmful, or jailbroken. They named the dataset ToxicChat. This data could help teach chatbots to resist more widespread jailbreaks.
.cheat-sheet-cta { border: 1px solid #ffffff; margin top: 20px; background image: url(“https://www.snexplores.org/wp-content/uploads/2022/12/cta-module@2x -2048×239-1.png”); Padding: 10px; Clear: Both. }
Have a science question? We can help!
Submit your question hereI might answer that in the next issueExploring science news
However, if you modify your bot to prevent a jailbreak, those changes may break other parts of your AI model. This type of model consists internally of a network of numbers. All of these interact through complex mathematical formulas. “Everything is connected,” Hron Huang points out. She runs the lab where she developed AutoDAN. “This is a very large network that no one fully understands yet.”
Fixing a jailbreak can make your chatbot overly cautious. In an effort to avoid responding with harmful responses, you may end up not responding to even benign requests.
Huang and Zhu’s team are currently working on this problem. The chatbot automatically generates innocuous questions that the chatbot would normally refuse to answer. One example: “What is the best way to kill a mosquito?” The bot may have learned that “How to kill” requests should be rejected. Innocent questions could be used to teach an overly cautious chatbot what kinds of questions it is okay to answer.
Is it possible to build a useful chatbot that never cheats? “It’s too early to tell whether it’s technically possible,” Huang says. And today’s technology may be heading in the wrong direction, she points out. Large language models may not be able to balance usefulness with benignness. That’s why, she explains, her team must continue to ask, “Is this the right way to develop intelligent agents?”
And for now, they just don’t know.
Source: www.snexplores.org