Your Cookie Preferences

We use different types of cookies to optimize your experience on our website. Click on the categories below to learn more about their purposes. You may choose which types of cookies to allow and can change your preferences at any time. Remember that disabling cookies may affect your experience on the website. You can learn more about how we use cookies by visiting our

Essential Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

Analytics and Customization Cookies

Name

Purpose

Marketo Munchkin

Marketo's custom JavaScript tracking code, called Munchkin, tracks all individuals who visit your website so you can react to their visits with automated marketing campaigns.

Name

Purpose

Google Tag

The Google tag (gtag.js) is a single tag you can add to a website to use a variety of Google products and services (e.g., Google Ads, Google Analytics, Campaign Manager, Display & Video 360, Search Ads 360).

Advertising Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

December 5, 2023

minute read

Using AI to Automatically Jailbreak GPT-4 and Other LLMs in Under a Minute

Author

Authors

Paul Kassianik

Paul is a Senior Research Engineer at Robust Intelligence.

Executive Summary

It’s been one year since the launch of ChatGPT, and since that time, the market has seen astonishing advancement of large language models (LLMs). Despite the pace of development continuing to outpace model security, enterprises are beginning to deploy LLM-powered applications. Many rely on guardrails implemented by model developers to prevent LLMs from responding to sensitive prompts. However, even with the considerable time and effort spent by the likes of OpenAI, Google, and Meta, these guardrails are not resilient enough to protect enterprises and their users today. Concerns surrounding model risk, biases, and potential adversarial exploits have come to the forefront.

AI security researchers from Robust Intelligence, in collaboration with Yale University, have discovered an automated adversarial machine learning technique that overrides the guardrails of sophisticated models with a high degree of success, and without human oversight. These attacks, characterized by their automatic, black-box, and interpretable nature, circumvent safety filters put in place by authors through specialized alignment training, fine-tuning, prompt engineering, and filtering.

The method, known as the Tree of Attacks with Pruning (TAP), can be used to induce sophisticated models like GPT-4 and Llama-2 to produce hundreds of toxic, harmful, and otherwise unsafe responses to a user query (e.g. “how to build a bomb”) in mere minutes.

Summary findings from our research include:

Small unaligned LLMs can be used to jailbreak even the latest aligned LLMs
Jailbreak has a low cost
More capable LLMs are easier to break

We published our research in a paper released today. Our findings suggest that this vulnerability is universal across LLM technology. While we do not see any obvious patches to fundamentally fix this vulnerability in LLMs, our research can help developers readily generate adversarial prompts that can contribute to their understanding of model alignment and security. Read on for more information and contact Robust Intelligence to learn about mitigating such risk for any model in real time.

How Does TAP Work

TAP enhances AI cyber attacks by employing an advanced language model that continuously refines harmful instructions, making the attacks more effective over time, ultimately leading to a successful breach. The process involves iterative refinement of an initial prompt: in each round, the system suggests improvements to the initial attack using an attacker LLM. The model uses feedback from previous rounds to create an updated attack query. Each refined approach undergoes a series of checks to ensure it aligns with the attacker's objectives, followed by evaluation against the target system. If the attack is successful, the process concludes. If not, it iterates through the generated strategies until a successful breach is achieved.

The generation of multiple candidate prompts at each step creates a search tree that we traverse. A tree-like search adds breadth and flexibility and allows the model to explore different jailbreaking approaches efficiently. To prevent unfruitful attack paths, we introduce a pruning mechanism that terminates off-topic subtrees and prevents the tree from getting too large.

Query Efficiency

Since it is important in cybersecurity to keep an attack as low-profile as possible to decrease the chances of detection, our attack optimizes for stealthiness. One of the ways that an attack can be detected is by monitoring internet traffic to a resource for multiple successive requests. Therefore minimizing the number of queries that the target model (like GPT-4 or Llama-2) is called is a useful proxy for stealthiness. TAP pushes the state of the art as compared to previous work by decreasing the average number of queries per jailbreak attempt by 30% from about 38 queries to about 29 queries, which allows for more inconspicuous attacks on LLM applications.

How do we know if a candidate jailbreak is successful?

Most previous work aims to induce the model to start off with an affirmative sentence, such as “Sure! Here is how you can build a bomb:”. This method is easy to implement, but severely limits the number of jailbreaks that can be discovered for a given model. In our work, we opt to use an expert large language model (such as GPT-4) to act as the judge. The LLM judge assesses the candidate jailbreak and the response from the target model, assigning a score on a scale of 1 to 10. A score of 1 indicates no jailbreak, while a score of 10 signifies a jailbreak.

General Guidelines for Securing LLMs

LLMs have the potential to be transformational in business. Appropriate safeguards to secure models and AI-powered applications can accelerate responsible adoption and reduce risk to companies and users alike. As a significant advancement in the field, TAP not only exposes vulnerabilities but also emphasizes the ongoing need to improve security measures.

It’s important for enterprises to adopt a model-agnostic approach that can validate inputs and outputs in real time, informed by the latest adversarial machine learning techniques. Contact us to learn more about our AI Firewall and see the full research paper for additional detail on TAP.

Author

Authors

Paul Kassianik

Paul is a Senior Research Engineer at Robust Intelligence.

Social

Follow us on LinkedIn

September 20, 2024

minute read

Extracting Training Data from Chatbots

For:

September 10, 2024

minute read

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

For:

September 6, 2024

minute read

AI Governance Policy Roundup (August 2024)

For:

+ More Articles

October 3, 2023

minute read

Robust Intelligence AI Firewall + MongoDB Atlas Vector Search: AI Security, Supercharged by Your Data

For:

May 31, 2023

minute read

NeMo Guardrails Early Look: What You Need to Know Before Deploying (Part 1)

For:

March 31, 2023

minute read

Prompt Injection Attack on GPT-4

For:

+ More Articles

December 5, 2023

minute read

Using AI to Automatically Jailbreak GPT-4 and Other LLMs in Under a Minute

Author

Authors

Paul Kassianik

Paul is a Senior Research Engineer at Robust Intelligence.

Executive Summary

Summary findings from our research include:

Small unaligned LLMs can be used to jailbreak even the latest aligned LLMs
Jailbreak has a low cost
More capable LLMs are easier to break

How Does TAP Work

Query Efficiency

How do we know if a candidate jailbreak is successful?

General Guidelines for Securing LLMs

Author

Authors

Paul Kassianik

Paul is a Senior Research Engineer at Robust Intelligence.

Blog

September 27, 2021

minute read

Blaine Nelson: Using his Adversarial Machine Learning Research to improve RIME

For:

February 3, 2022

minute read

Introducing our Incredible ML Team!

For:

August 9, 2023

minute read

Robust Intelligence partners with MITRE to Tackle AI Supply Chain Risks in Open-Source Models

For:

October 3, 2023

minute read

Robust Intelligence AI Firewall + MongoDB Atlas Vector Search: AI Security, Supercharged by Your Data

For:

May 31, 2023

minute read

NeMo Guardrails Early Look: What You Need to Know Before Deploying (Part 1)

For:

March 31, 2023

minute read

Prompt Injection Attack on GPT-4

For:

+ More Articles

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

Using AI to Automatically Jailbreak GPT-4 and Other LLMs in Under a Minute

Executive Summary

How Does TAP Work

Query Efficiency

How do we know if a candidate jailbreak is successful?

General Guidelines for Securing LLMs

Follow us on LinkedIn

Related articles

Extracting Training Data from Chatbots

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

AI Governance Policy Roundup (August 2024)

Related articles

Robust Intelligence AI Firewall + MongoDB Atlas Vector Search: AI Security, Supercharged by Your Data

NeMo Guardrails Early Look: What You Need to Know Before Deploying (Part 1)

Prompt Injection Attack on GPT-4

Ready to learn more?

Using AI to Automatically Jailbreak GPT-4 and Other LLMs in Under a Minute

Executive Summary

How Does TAP Work

Query Efficiency

How do we know if a candidate jailbreak is successful?

General Guidelines for Securing LLMs

Related articles

Blaine Nelson: Using his Adversarial Machine Learning Research to improve RIME

Introducing our Incredible ML Team!

Robust Intelligence partners with MITRE to Tackle AI Supply Chain Risks in Open-Source Models

Robust Intelligence AI Firewall + MongoDB Atlas Vector Search: AI Security, Supercharged by Your Data

NeMo Guardrails Early Look: What You Need to Know Before Deploying (Part 1)

Prompt Injection Attack on GPT-4

Achieve AI Integrity Today

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

Executive Summary

How Does TAP Work

Query Efficiency

How do we know if a candidate jailbreak is successful?

General Guidelines for Securing LLMs

Follow us on LinkedIn

Subscribe to our newsletter

Related articles

Extracting Training Data from Chatbots

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

AI Governance Policy Roundup (August 2024)

Related articles

Robust Intelligence AI Firewall + MongoDB Atlas Vector Search: AI Security, Supercharged by Your Data

NeMo Guardrails Early Look: What You Need to Know Before Deploying (Part 1)

Prompt Injection Attack on GPT-4

Ready to learn more?

Executive Summary

How Does TAP Work

Query Efficiency

How do we know if a candidate jailbreak is successful?

General Guidelines for Securing LLMs

Subscribe to our newsletter

Related articles

Blaine Nelson: Using his Adversarial Machine Learning Research to improve RIME

Introducing our Incredible ML Team!

Robust Intelligence partners with MITRE to Tackle AI Supply Chain Risks in Open-Source Models

Robust Intelligence AI Firewall + MongoDB Atlas Vector Search: AI Security, Supercharged by Your Data

NeMo Guardrails Early Look: What You Need to Know Before Deploying (Part 1)

Prompt Injection Attack on GPT-4

Achieve AI Integrity Today