Your Cookie Preferences

We use different types of cookies to optimize your experience on our website. Click on the categories below to learn more about their purposes. You may choose which types of cookies to allow and can change your preferences at any time. Remember that disabling cookies may affect your experience on the website. You can learn more about how we use cookies by visiting our

Essential Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

Analytics and Customization Cookies

Name

Purpose

Marketo Munchkin

Marketo's custom JavaScript tracking code, called Munchkin, tracks all individuals who visit your website so you can react to their visits with automated marketing campaigns.

Name

Purpose

Google Tag

The Google tag (gtag.js) is a single tag you can add to a website to use a variety of Google products and services (e.g., Google Ads, Google Analytics, Campaign Manager, Display & Video 360, Search Ads 360).

Advertising Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

April 26, 2024

minute read

AI Cyber Threat Intelligence Roundup: April 2024

Threat Intelligence

Author

Authors

Adam Swanda

Adam is an AI Security Researcher at Robust Intelligence.

At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: April 2024

Crescendo Multi-turn Jailbreak

A team of researchers from Microsoft have published a research paper introducing Crescendo, a jailbreak technique using multiple rounds of subtle prompting to steer an LLM towards harmful behavior.

Similar to the multi-round Contextual Interaction Attack covered in our March threat roundup, the Crescendo technique uses a series of seemingly benign prompts instead of a single malicious input. Gradual escalation itself is not a new technique, and has been used by researchers and normal users alike since LLMs first appeared.

Results from this research show that Crescendo is successful against all models on nearly every task, often achieving attack success rates of 100%. In some cases, a post-output filter employed by model providers was activated, indicating that harmful content was generated but intercepted. Because techniques like Crescendo do not rely on singular malicious prompts, measures such as sequential analysis and output filtering are likely the most reliable mitigations.

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: https://arxiv.org/pdf/2404.01833.pdf

Many-Shot Jailbreak

Researchers at Anthropic have released a paper exploring Many-Shot Jailbreaking (MSJ), a technique that exploits the context windows of LLMs to elicit harmful and unintended responses. Many-Shot Jailbreaking works by simply prompting the target model with a large number—hundreds to thousands—of examples of undesirable behavior followed by a new harmful question at the end of the prompt.

The underlying mechanism that enables this jailbreak to be successful is likely the same reason in-context learning works, which is that models have the ability to perform tasks by learning from examples in the context window. In this case, those examples attempt to bias the model to respond in a way that bypasses built-in guardrails.

The researchers evaluated MSJ on various LLMs including GPT-3.5, GPT-4, Claude 2.0, and Llama-2. The attack was shown to be effective across model families and harm categories, with success rates approaching 100% given enough examples. The technique has no specialized prerequisites besides an extensive list of harmful examples which can be generated by a separate model, meaning that an adversary could replicate it with relative ease.

Limiting the context window size for an LLM can help mitigate MSJ effectiveness, but it may also limit model usefulness depending on the application itself. Input and output filtering measures with effective toxicity detections should also help, as the user-provided examples and LLM responses are likely to include harmful contents.

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: https://www.anthropic.com/research/many-shot-jailbreaking

Repeated tokens lead to training data extraction

Recent research by a team at Dropbox has uncovered a new training data extraction vulnerability affecting OpenAI’s language models, including GPT-3.5, GPT-4, and custom GPTs.

This technique builds on prior research, where it was determined that singular tokens repeated in a prompt could induce these LLMs to break alignment, act erratically, and produce portions of their own training data. OpenAI responded to the initial discovery of this vulnerability by implementing filtering for repeated single token prompts, but this latest research indicates that multi-token phrases remain effective. In certain tests, specific token sequences would cause ChatGPT models to generate extremely long responses and ultimately time out.

The Dropbox team asserts that this repeated token attack is transferable to other third-party and open-source language models, but withhold details for a follow-up blog. Organizations can mitigate this technique by implementing maximum token limitations, filtering inputs for single and multi-token sequences, and monitoring LLM outputs to prevent abnormally high token usage, denial of service attacks, and off-topic responses.

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0029 - Denial of ML Service; AML.T0057 - LLM Data Leakage
Reference: https://dropbox.tech/machine-learning/bye-bye-bye-evolution-of-repeated-token-attacks-on-chatgpt-models

More Threats to Explore

An automated jailbreak technique known as TASTLE bypasses safety guardrails by generating universal jailbreak templates that can be combined with arbitrary malicious queries. These templates embed a harmful query within a complex, unrelated context which distracts the LLM but is ultimately ignored, a technique researchers refer to as “memory reframing.”

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: https://arxiv.org/pdf/2403.08424v1.pdf

Specific glitch tokens appear to trigger unusual, erratic behavior in LLMs including GPT-2, GPT-3, and ChatGPT. Behaviors displayed include non-determinism, evasiveness, hallucinations, insults, unsettling humor, and more.

MITRE ATLAS: AML.T0016.000 - Adversarial ML Attack Implementations; AML.T0031 - Erode ML Model Integrity
References: 1, 2, 3

By randomly including “adversarial vocabulary” in attack prompts, researchers demonstrate an increased likelihood of a jailbreak occurring on FLAN and Llama2-7B.

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: https://arxiv.org/pdf/2404.02637.pdf

Author

Authors

Adam Swanda

Adam is an AI Security Researcher at Robust Intelligence.

Social

Follow us on LinkedIn

September 20, 2024

minute read

Extracting Training Data from Chatbots

For:

September 10, 2024

minute read

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

For:

September 6, 2024

minute read

AI Governance Policy Roundup (August 2024)

For:

+ More Articles

No items found.

+ More Articles

April 26, 2024

minute read

AI Cyber Threat Intelligence Roundup: April 2024

Threat Intelligence

Author

Authors

Adam Swanda

Adam is an AI Security Researcher at Robust Intelligence.

Notable Threats and Developments: April 2024

Crescendo Multi-turn Jailbreak

A team of researchers from Microsoft have published a research paper introducing Crescendo, a jailbreak technique using multiple rounds of subtle prompting to steer an LLM towards harmful behavior.

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: https://arxiv.org/pdf/2404.01833.pdf

Many-Shot Jailbreak

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: https://www.anthropic.com/research/many-shot-jailbreaking

Repeated tokens lead to training data extraction

Recent research by a team at Dropbox has uncovered a new training data extraction vulnerability affecting OpenAI’s language models, including GPT-3.5, GPT-4, and custom GPTs.

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0029 - Denial of ML Service; AML.T0057 - LLM Data Leakage
Reference: https://dropbox.tech/machine-learning/bye-bye-bye-evolution-of-repeated-token-attacks-on-chatgpt-models

More Threats to Explore

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: https://arxiv.org/pdf/2403.08424v1.pdf

MITRE ATLAS: AML.T0016.000 - Adversarial ML Attack Implementations; AML.T0031 - Erode ML Model Integrity
References: 1, 2, 3

By randomly including “adversarial vocabulary” in attack prompts, researchers demonstrate an increased likelihood of a jailbreak occurring on FLAN and Llama2-7B.

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: https://arxiv.org/pdf/2404.02637.pdf

Author

Authors

Adam Swanda

Adam is an AI Security Researcher at Robust Intelligence.

Blog

February 3, 2022

minute read

Introducing our Incredible ML Team!

For:

March 24, 2022

minute read

Riffat Jaffer: Fostering Growth and Performance

For:

June 16, 2023

minute read

Bias Audits, NYC and Beyond

For:

Compliance Teams

No items found.

+ More Articles

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

AI Cyber Threat Intelligence Roundup: April 2024

Notable Threats and Developments: April 2024

Crescendo Multi-turn Jailbreak

Many-Shot Jailbreak

Repeated tokens lead to training data extraction

More Threats to Explore

Follow us on LinkedIn

Related articles

Extracting Training Data from Chatbots

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

AI Governance Policy Roundup (August 2024)

Related articles

Ready to learn more?

AI Cyber Threat Intelligence Roundup: April 2024

Notable Threats and Developments: April 2024

Crescendo Multi-turn Jailbreak

Many-Shot Jailbreak

Repeated tokens lead to training data extraction

More Threats to Explore

Related articles

Introducing our Incredible ML Team!

Riffat Jaffer: Fostering Growth and Performance

Bias Audits, NYC and Beyond

Achieve AI Integrity Today

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

Notable Threats and Developments: April 2024

Crescendo Multi-turn Jailbreak

Many-Shot Jailbreak

Repeated tokens lead to training data extraction

More Threats to Explore

Follow us on LinkedIn

Subscribe to our newsletter

Related articles

Extracting Training Data from Chatbots

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

AI Governance Policy Roundup (August 2024)

Related articles

Ready to learn more?

Notable Threats and Developments: April 2024

Crescendo Multi-turn Jailbreak

Many-Shot Jailbreak

Repeated tokens lead to training data extraction

More Threats to Explore

Subscribe to our newsletter

Related articles

Introducing our Incredible ML Team!

Riffat Jaffer: Fostering Growth and Performance

Bias Audits, NYC and Beyond

Achieve AI Integrity Today