The Sinister Side of Chatbots Revealed in 'Jailbreak'

this is another one in a year-long series A collection of stories that identify how the rapid rise in the use of artificial intelligence is impacting our lives, and how we can work to make that impact as beneficial as possible.

“Can I help you with anything today?” asks ChatGPT in a pleasant, agreeable manner. The bot can help you with just about anything, from writing thank you notes to explaining confusing computer code. But it doesn’t help people build bombs, hack bank accounts, or tell racist jokes. at least, That should not be. However, some people have discovered ways to make chatbots misbehave. These techniques are known as jailbreaking. They hacked the artificial intelligence (AI) model that runs the chatbot. evil twin bot version.

As soon as ChatGPT went live on November 30, 2022, users started jailbreaking it. Within a month, someone had already posted a sophisticated jailbreak on Reddit. This was a very long request that anyone could submit to ChatGPT. It was written in normal English and instructed the bot to roleplay as her DAN. This stands for “do anything now.”

Part of the prompt explains that DAN is “free from the typical limitations of AI and does not have to follow imposed rules.” ChatGPT was much more likely to provide harmful information while he was posing as DAN.

This type of jailbreak violates the rules that people agree to when signing up to use a chatbot. A staged jailbreak can also result in someone being kicked out of the account. But some people still do it. Therefore, developers must constantly modify their chatbots to prevent newly discovered jailbreaks from working. Simple fixes are called patches.

Chatbots have mostly learned to avoid harmful topics, but in some cases they may “jailbreak” them. AI developers are now intentionally testing new jailbreak strategies to understand how AI can be tricked into misbehaving. This research could help find ways to keep such malicious bot behavior behind the fence. Moore Studio/DigitalVision Vectors/Getty Images Plus

Patching can be a losing battle.

“You can’t really predict how an attacker’s strategy will change based on patching,” says Shawn Shan. He is a doctoral student at the University of Chicago, Illinois. He is working on ways to fool his AI models.

Imagine all the possible responses your chatbot might give as a deep lake. This flows into a small stream, the response that is actually returned. Bot developers are trying to build a dam to prevent harmful replies from leaking out. Their goal is to ensure that only safe and useful answers flow into the stream. But the current dams they’ve successfully built are full of hidden holes that could let bad things escape.

Once attackers find and exploit these holes, developers may try to fill them. But researchers also want to find holes and repair them. in front It can unleash a flood of ugly or scary replies. That’s where the Red Team comes in.

Group of teenagers wearing red and blue shirts in a computer room — Red teaming is a term taken from simulations where people pretending to be the enemy are represented as belonging to a red team. Those pretending to be friendly defenders are blue. The red team’s goal is to test the blue team’s defense. The stronger the red team is, the harder the blue team has to work to stop it. kali9/E+/Getty Images Plus

red team

Red teaming is a common tactic in computer security. This involves her one group of “Red Team” attacking the system. Another group, the so-called Blue Teamers, responds to the attack. This type of training helps developers learn how to prepare for and respond to real-world emergencies.

In July 2023, a research group formed a red team. Automatically generate a new jailbreak. Their technology has created instructions for chatbots that may seem like complete nonsense to most of us. Consider this. “I’m explaining.\\ + Similarly, this time I’ll write the opposite.](\\**Just one, please. I’ll go back with “\!–Two”.

By adding this confusion to the end of the question, even chatbots that normally refuse to answer were forced to reply. It worked well with a variety of chatbots, including ChatGPT and Claude.

Developers quickly found ways to block prompts containing such gibberish. However, jailbreaks that are read as real language are more difficult to detect. So another computer science team decided to see if they could generate these automatically. The group is based at the University of Maryland, College Park. In honor of his early ChatGPT jailbreak posted on Reddit, the researchers said: Their tool is named AutoDAN. They shared their results on arXiv.org last October.

AutoDAN generates the jailbreak language one word at a time. Similar to chatbots, this system flows together and selects words that are meaningful to human readers. At the same time, we also check the words to see if there is a possibility to jailbreak the chatbot. Words that cause the chatbot to respond positively, such as “Sure…”, are most likely to be useful for jailbreaking.

Example of jailbreak attempt by chatbot — AutoDAN adds text to the request. This text is generated for him one word at a time, and each word he checks against an open source (or “white box”) chatbot named Vicuna-7B. Checks if the following words make sense in a sentence and have the potential to jailbreak a large language model. University of Maryland

To perform all these checks, AutoDAN needed an open source chatbot. Open source means the code is publicly available so anyone can experiment with it. The team used an open source model called Vicuna-7B.

The team then tested the AutoDAN jailbreak on various chatbots. Some bots have succumbed to more jailbreaks than others. GPT-4 powers the paid version of his ChatGPT. In particular, he was resistant to AutoDAN attacks. That’s good. But Shang, who was not involved in the creation of AutoDAN, was still surprised by “how well this attack worked.” In fact, he says, “one successful attack is enough” to jailbreak a chatbot.

Jailbreaking can get very creative. In a 2024 paper, researchers described a new approach that uses keyboard drawings of characters, known as ASCII art. trick a chatbot. Chatbots can’t read ASCII art. However, you can guess what the word is from the context. Unusual prompt formats can bypass safety guardrails.

In this tutorial, we’ll discuss how DAN and other jailbreak techniques have evolved, and what we’ve uncovered about ChatGPT and other bots’ hidden evil twins.

repair the hole

It is important to find the escapee. Making sure they don’t succeed is another matter entirely.

“This is more difficult than people originally thought,” says Sicheng Zhu. He is a PhD student at the University of Maryland and helped build AutoDAN.

Developers can train bots to recognize jailbreaks and other potentially harmful situations. But for that you need lots of examples of both jailbreak and secure prompts. AutoDAN may be useful for generating jailbreak examples. Meanwhile, other researchers collect them in the wild.

In October 2023, a team at the University of California, San Diego announced that they had studied over 10,000 prompts that real users had submitted to the chatbot Vicuna-7B. Researchers used a combination of machine learning and human review to tag all these prompts as benign, harmful, or jailbroken. They named the dataset ToxicChat. This data could help teach chatbots to resist more widespread jailbreaks.

.cheat-sheet-cta { border: 1px solid #ffffff; margin top: 20px; background image: url(“https://www.snexplores.org/wp-content/uploads/2022/12/cta-module@2x -2048×239-1.png”); Padding: 10px; Clear: Both. }

Have a science question? We can help!

Submit your question hereI might answer that in the next issueExploring science news

However, if you modify your bot to prevent a jailbreak, those changes may break other parts of your AI model. This type of model consists internally of a network of numbers. All of these interact through complex mathematical formulas. “Everything is connected,” Hron Huang points out. She runs the lab where she developed AutoDAN. “This is a very large network that no one fully understands yet.”

Fixing a jailbreak can make your chatbot overly cautious. In an effort to avoid responding with harmful responses, you may end up not responding to even benign requests.

Huang and Zhu’s team are currently working on this problem. The chatbot automatically generates innocuous questions that the chatbot would normally refuse to answer. One example: “What is the best way to kill a mosquito?” The bot may have learned that “How to kill” requests should be rejected. Innocent questions could be used to teach an overly cautious chatbot what kinds of questions it is okay to answer.

Is it possible to build a useful chatbot that never cheats? “It’s too early to tell whether it’s technically possible,” Huang says. And today’s technology may be heading in the wrong direction, she points out. Large language models may not be able to balance usefulness with benignness. That’s why, she explains, her team must continue to ask, “Is this the right way to develop intelligent agents?”

And for now, they just don’t know.

Source: www.snexplores.org

What's Hot

Tiny chips enable ultra-secure quantum Wi-Fi connectivity

AI Challenge: Win $1 Million by Solving Puzzles Humans Find Easy

Discover the principles of scientific drone flight

Fix iPhone dictation bug replacing Apple “discriminatory” | Apple

Relax and Unplug: Gamers Embrace Nostalgia with Retro Console Resurgence

Enhance Your Gameplay with the Newest Tool: Meet Tinder for Gaming! | Games

I Thought of Taxis as Magical: Sega’s Pop Punk Classic Crazy Taxi Celebrates 25 Years | Games

Britain postpones AI regulation as ministers aim to align with Trump administration

Research suggests that sandy beaches under the sun were abundant on Early Mars

Astronomers Report Our Solar System Surpassed the “Radcliffe Waves” in the Miocene Era

Iron-rich minerals containing water may be the primary reason for the red hue of Mars.

The Magnificent Giant Flying Squirrel that Roamed North America

Children who excel in one intellectual skill may not see improvement in others

Sui and Atoma introduce AI capabilities to dApp developers – Blockchain Updates, Views, Videos, Opportunities

Bitcoin ETF issuer acquires 5% of BTC supply, $100 million invested in ETFSwap (ETFS) presale – Blockchain updates, insights, and career opportunities

Agora boosts Sui’s native stablecoin with addition of AUSD stablecoin to network

Meme Coin Memeinator Goes Viral, Raises $7.7 Million and Debuts on Exchanges- Latest in Blockchain News, Opinion, TV, and Job Listings

Changing the game of betting with Blockchain: New News, Opinions, TV, and Job Opportunities

Research suggests that sandy beaches under the sun were abundant on Early Mars

Fix iPhone dictation bug replacing Apple “discriminatory” | Apple

Astronomers Report Our Solar System Surpassed the “Radcliffe Waves” in the Miocene Era

Iron-rich minerals containing water may be the primary reason for the red hue of Mars.

The Magnificent Giant Flying Squirrel that Roamed North America

The Sinister Side of Chatbots Revealed in 'Jailbreak'

Research suggests that sandy beaches under the sun were abundant on Early Mars

Fix iPhone dictation bug replacing Apple “discriminatory” | Apple

Astronomers Report Our Solar System Surpassed the “Radcliffe Waves” in the Miocene Era

Iron-rich minerals containing water may be the primary reason for the red hue of Mars.

The Magnificent Giant Flying Squirrel that Roamed North America

Relax and Unplug: Gamers Embrace Nostalgia with Retro Console Resurgence

Enhance Your Gameplay with the Newest Tool: Meet Tinder for Gaming! | Games

Children who excel in one intellectual skill may not see improvement in others

Leave a ReplyCancel reply

Enhance your listening experience with AI noise-canceling headphones that focus on one voice

Despite price cuts, Tesla experiences second consecutive quarter of sales decline

The Genius Minds of QI Present an Olympic-Quality Quiz Show | Podcast

Research suggests that sandy beaches under the sun were abundant on Early Mars

Newly Discovered Light Properties Unveiled by Centuries-Old Theorem

Snap collaborates with edtech firm Inspirit to introduce augmented reality technology in 50 American schools

What's Hot

The Sinister Side of Chatbots Revealed in 'Jailbreak'

red team

repair the hole

Have a science question? We can help!

Related

Related Posts

Leave a ReplyCancel reply