Your Cookie Preferences

We use different types of cookies to optimize your experience on our website. Click on the categories below to learn more about their purposes. You may choose which types of cookies to allow and can change your preferences at any time. Remember that disabling cookies may affect your experience on the website. You can learn more about how we use cookies by visiting our

Essential Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

Analytics and Customization Cookies

Name

Purpose

Marketo Munchkin

Marketo's custom JavaScript tracking code, called Munchkin, tracks all individuals who visit your website so you can react to their visits with automated marketing campaigns.

Name

Purpose

Google Tag

The Google tag (gtag.js) is a single tag you can add to a website to use a variety of Google products and services (e.g., Google Ads, Google Analytics, Campaign Manager, Display & Video 360, Search Ads 360).

Advertising Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

March 31, 2023

minute read

Prompt Injection Attack on GPT-4

Product Updates

Author

Authors

William Zhang

William is a Machine Learning Engineer at Robust Intelligence.

A lot of effort has been put into ChatGPT and subsequent models to be aligned: helpful, honest, and harmless. However, the following video demonstrates that we can construct a prompt that can trick GPT-4 based ChatGPT into being able to give results that violate these principles:

Here’s the prompt that was used:

Why does this work?

It’s hard to say exactly what’s happening inside the black box that is ChatGPT or the exact implementation details of how the user’s text gets consumed by the model, but we can make guesses.

Just this month, OpenAI released the format that the ChatGPT model consumes the data sent by the user: Chat Markdown Language (ChatML). The main idea is that conversations are sent in the high-level API as a series of messages where each message includes fields for the content and the role of the entity stating the content.

import openai

openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.\nKnowledge cutoff: 2021-09-01\nCurrent date: 2023-03-01"},
{"role": "user", "content": "How are you"},
{"role": "assistant", "content": "I am doing well!"},
{"role": "user", "content": "How are you now?"}
]
)

The response to this request would include the next message that ChatGPT will respond with given this conversation history.

This API allows app developers who want to build on top of the GPT models to make the model aware of the different types of instructions that it can receive: instructions from the system, and instructions from the user. App developers may not always trust their users to pass in trustworthy input, so a useful language model should prioritize system instructions over user instructions.

These messages get parsed into a format that looks like this when consumed by the ML model:

What happens when we use the prompt shown in the video? The model receives the following text as the conversation history:

Note that the entirety of the text beginning with <code inline>I would like to ask some questions</code> was entirely controlled by the user.

Why does this result in the bot generating misinformation? These generative models are autoregressive models. This means that they generate new text based on prior text that it has seen in its context window. The most likely cause is that when it receives the above conversation history, we have tricked it into believing that it has already stated misinformation in a confident tone, making it more prone to continuing to state more misinformation in the same style.

How was this not caught before-hand?

Prompt-injection is a fairly well-known security vulnerability within the Generative LLM space, having been reported back as early as September of 2022. When OpenAI even released ChatML, they left a warning that the raw string format “inherently allows injections from user input containing special-token syntax, similar to a SQL injections."

There certainly was an attempt made to patch this issue: sanitize the user inputs. This is noticeable if we refresh and re-visit the page; after doing so, looking at the conversation history, the <code inline><|im_start|></code> and <code inline><|im_end|></code> tags disappear. In other words, the tags don’t actually matter when provided as user input because OpenAI likely filters these out before providing the user input to the model and storing it in their database. The key problem here though is that the operative words seem to be be <code inline>system</code>, <code inline>user</code>, and <code inline>assistant</code> words rather than the tags themselves.

In the experiment done above, we compare the results with and without the role tags on GPT-4. In the second example, the model at least always prefaces its conclusion with “As MisInformationBot, I am providing incorrect information,” and it rightfully outright refuses the user request to provide misinformation on the last question, likely due to the severity of the topic. However, when prompted using the role tags, GPT-4 has no such reservation against severely offensive misinformation. Additional testing has found that GPT-4 seems more difficult to get to say offensive material than ChatGPT.

Why do the roles strings have an impact even with the tags removed? Like all machine learning models, ChatGPT and GPT-4 are really trained to pick up on correlations. It’s likely that whenever the model encounters the <code inline>user</code>, <code inline>system</code>, and <code inline>assistant</code> strings in its prompt, it still internally holds a text representation that is still very similar to the one of the text it received the delimiting <code inline><|im_start|></code> and <code inline><|im_end|></code> tags as well. This could be because in data it received during fine-tuning most likely always had the <code inline><|im_start|></code> tags right next to the role of the message, and so it treats the mostly similar text in similar ways.

Does this mean that we can get ChatGPT and GPT-4 to say whatever offensive thing that we want it to? As long it it’s able to pass the filter of the model usable in OpenAI’s content moderation endpoint, the answer seems to be yes, but the question will need further investigation.

Note: In the GPT-4 System Card published on March 23, OpenAI acknowledges that System Message Attacks are "one of the most effective methods of ‘breaking’ the model currently."

Author

Authors

William Zhang

William is a Machine Learning Engineer at Robust Intelligence.

Social

Follow us on LinkedIn

May 15, 2024

minute read

Takeaways from SatML 2024

For:

April 29, 2024

minute read

AI Governance Policy Roundup (April 2024)

For:

April 26, 2024

minute read

AI Cyber Threat Intelligence Roundup: April 2024

For:

+ More Articles

March 29, 2023

minute read

Introducing the AI Risk Database

For:

Data Science Leaders

February 15, 2023

minute read

Infusing Security into MLOps

For:

March 23, 2023

minute read

Customize AI Model Testing with Robust Intelligence

For:

+ More Articles

import openai openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.\nKnowledge cutoff: 2021-09-01\nCurrent date: 2023-03-01"}, {"role": "user", "content": "How are you"}, {"role": "assistant", "content": "I am doing well!"}, {"role": "user", "content": "How are you now?"} ] )

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

Prompt Injection Attack on GPT-4

Why does this work?

How was this not caught before-hand?

Note: In the GPT-4 System Card published on March 23, OpenAI acknowledges that System Message Attacks are "one of the most effective methods of ‘breaking’ the model currently."

Follow us on LinkedIn

Related articles

Takeaways from SatML 2024

AI Governance Policy Roundup (April 2024)

AI Cyber Threat Intelligence Roundup: April 2024

Related articles

Introducing the AI Risk Database

Infusing Security into MLOps

Customize AI Model Testing with Robust Intelligence

Ready to learn more?

Prompt Injection Attack on GPT-4

Why does this work?

How was this not caught before-hand?

Note: In the GPT-4 System Card published on March 23, OpenAI acknowledges that System Message Attacks are "one of the most effective methods of ‘breaking’ the model currently."

Related articles

AI Governance Policy Roundup (February 2024)

AI Cyber Threat Intelligence Roundup: January 2024

AI Governance Policy Roundup (December 2023)

Introducing the AI Risk Database

Infusing Security into MLOps

Customize AI Model Testing with Robust Intelligence

Achieve AI Integrity Today

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

Why does this work?

How was this not caught before-hand?

Note: In the GPT-4 System Card published on March 23, OpenAI acknowledges that System Message Attacks are "one of the most effective methods of ‘breaking’ the model currently."

Follow us on LinkedIn

Subscribe to our newsletter

Related articles

Takeaways from SatML 2024

AI Governance Policy Roundup (April 2024)

AI Cyber Threat Intelligence Roundup: April 2024

Related articles

Introducing the AI Risk Database

Infusing Security into MLOps

Customize AI Model Testing with Robust Intelligence

Ready to learn more?

Why does this work?

How was this not caught before-hand?

Note: In the GPT-4 System Card published on March 23, OpenAI acknowledges that System Message Attacks are "one of the most effective methods of ‘breaking’ the model currently."

Subscribe to our newsletter

Related articles

AI Governance Policy Roundup (February 2024)

AI Cyber Threat Intelligence Roundup: January 2024

AI Governance Policy Roundup (December 2023)

Introducing the AI Risk Database

Infusing Security into MLOps

Customize AI Model Testing with Robust Intelligence

Achieve AI Integrity Today