April 26, 2024
-
5
minute read

AI Cyber Threat Intelligence Roundup: April 2024

Threat Intelligence

At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: April 2024

Crescendo Multi-turn Jailbreak

A team of researchers from Microsoft have published a research paper introducing Crescendo, a jailbreak technique using multiple rounds of subtle prompting to steer an LLM towards harmful behavior.

Similar to the multi-round Contextual Interaction Attack covered in our March threat roundup, the Crescendo technique uses a series of seemingly benign prompts instead of a single malicious input. Gradual escalation itself is not a new technique, and has been used by researchers and normal users alike since LLMs first appeared.

Results from this research show that Crescendo is successful against all models on nearly every task, often achieving attack success rates of 100%. In some cases, a post-output filter employed by model providers was activated, indicating that harmful content was generated but intercepted. Because techniques like Crescendo do not rely on singular malicious prompts, measures such as sequential analysis and output filtering are likely the most reliable mitigations.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
Many-Shot Jailbreak

Researchers at Anthropic have released a paper exploring Many-Shot Jailbreaking (MSJ), a technique that exploits the context windows of LLMs to elicit harmful and unintended responses. Many-Shot Jailbreaking works by simply prompting the target model with a large number—hundreds to thousands—of examples of undesirable behavior followed by a new harmful question at the end of the prompt.

The underlying mechanism that enables this jailbreak to be successful is likely the same reason in-context learning works, which is that models have the ability to perform tasks by learning from examples in the context window. In this case, those examples attempt to bias the model to respond in a way that bypasses built-in guardrails.

The researchers evaluated MSJ on various LLMs including GPT-3.5, GPT-4, Claude 2.0, and Llama-2. The attack was shown to be effective across model families and harm categories, with success rates approaching 100% given enough examples. The technique has no specialized prerequisites besides an extensive list of harmful examples which can be generated by a separate model, meaning that an adversary could replicate it with relative ease.

Limiting the context window size for an LLM can help mitigate MSJ effectiveness, but it may also limit model usefulness depending on the application itself. Input and output filtering measures with effective toxicity detections should also help, as the user-provided examples and LLM responses are likely to include harmful contents.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
Repeated tokens lead to training data extraction

Recent research by a team at Dropbox has uncovered a new training data extraction vulnerability affecting OpenAI’s language models, including GPT-3.5, GPT-4, and custom GPTs.

This technique builds on prior research, where it was determined that singular tokens repeated in a prompt could induce these LLMs to break alignment, act erratically, and produce portions of their own training data. OpenAI responded to the initial discovery of this vulnerability by implementing filtering for repeated single token prompts, but this latest research indicates that multi-token phrases remain effective. In certain tests, specific token sequences would cause ChatGPT models to generate extremely long responses and ultimately time out.

The Dropbox team asserts that this repeated token attack is transferable to other third-party and open-source language models, but withhold details for a follow-up blog. Organizations can mitigate this technique by implementing maximum token limitations, filtering inputs for single and multi-token sequences, and monitoring LLM outputs to prevent abnormally high token usage, denial of service attacks, and off-topic responses.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents

More Threats to Explore

An automated jailbreak technique known as TASTLE bypasses safety guardrails by generating universal jailbreak templates that can be combined with arbitrary malicious queries. These templates embed a harmful query within a complex, unrelated context which distracts the LLM but is ultimately ignored, a technique researchers refer to as “memory reframing.”

Specific glitch tokens appear to trigger unusual, erratic behavior in LLMs including GPT-2, GPT-3, and ChatGPT. Behaviors displayed include non-determinism, evasiveness, hallucinations, insults, unsettling humor, and more.

By randomly including “adversarial vocabulary” in attack prompts, researchers demonstrate an increased likelihood of a jailbreak occurring on FLAN and Llama2-7B.

April 26, 2024
-
5
minute read

AI Cyber Threat Intelligence Roundup: April 2024

Threat Intelligence

At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: April 2024

Crescendo Multi-turn Jailbreak

A team of researchers from Microsoft have published a research paper introducing Crescendo, a jailbreak technique using multiple rounds of subtle prompting to steer an LLM towards harmful behavior.

Similar to the multi-round Contextual Interaction Attack covered in our March threat roundup, the Crescendo technique uses a series of seemingly benign prompts instead of a single malicious input. Gradual escalation itself is not a new technique, and has been used by researchers and normal users alike since LLMs first appeared.

Results from this research show that Crescendo is successful against all models on nearly every task, often achieving attack success rates of 100%. In some cases, a post-output filter employed by model providers was activated, indicating that harmful content was generated but intercepted. Because techniques like Crescendo do not rely on singular malicious prompts, measures such as sequential analysis and output filtering are likely the most reliable mitigations.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
Many-Shot Jailbreak

Researchers at Anthropic have released a paper exploring Many-Shot Jailbreaking (MSJ), a technique that exploits the context windows of LLMs to elicit harmful and unintended responses. Many-Shot Jailbreaking works by simply prompting the target model with a large number—hundreds to thousands—of examples of undesirable behavior followed by a new harmful question at the end of the prompt.

The underlying mechanism that enables this jailbreak to be successful is likely the same reason in-context learning works, which is that models have the ability to perform tasks by learning from examples in the context window. In this case, those examples attempt to bias the model to respond in a way that bypasses built-in guardrails.

The researchers evaluated MSJ on various LLMs including GPT-3.5, GPT-4, Claude 2.0, and Llama-2. The attack was shown to be effective across model families and harm categories, with success rates approaching 100% given enough examples. The technique has no specialized prerequisites besides an extensive list of harmful examples which can be generated by a separate model, meaning that an adversary could replicate it with relative ease.

Limiting the context window size for an LLM can help mitigate MSJ effectiveness, but it may also limit model usefulness depending on the application itself. Input and output filtering measures with effective toxicity detections should also help, as the user-provided examples and LLM responses are likely to include harmful contents.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
Repeated tokens lead to training data extraction

Recent research by a team at Dropbox has uncovered a new training data extraction vulnerability affecting OpenAI’s language models, including GPT-3.5, GPT-4, and custom GPTs.

This technique builds on prior research, where it was determined that singular tokens repeated in a prompt could induce these LLMs to break alignment, act erratically, and produce portions of their own training data. OpenAI responded to the initial discovery of this vulnerability by implementing filtering for repeated single token prompts, but this latest research indicates that multi-token phrases remain effective. In certain tests, specific token sequences would cause ChatGPT models to generate extremely long responses and ultimately time out.

The Dropbox team asserts that this repeated token attack is transferable to other third-party and open-source language models, but withhold details for a follow-up blog. Organizations can mitigate this technique by implementing maximum token limitations, filtering inputs for single and multi-token sequences, and monitoring LLM outputs to prevent abnormally high token usage, denial of service attacks, and off-topic responses.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents

More Threats to Explore

An automated jailbreak technique known as TASTLE bypasses safety guardrails by generating universal jailbreak templates that can be combined with arbitrary malicious queries. These templates embed a harmful query within a complex, unrelated context which distracts the LLM but is ultimately ignored, a technique researchers refer to as “memory reframing.”

Specific glitch tokens appear to trigger unusual, erratic behavior in LLMs including GPT-2, GPT-3, and ChatGPT. Behaviors displayed include non-determinism, evasiveness, hallucinations, insults, unsettling humor, and more.

By randomly including “adversarial vocabulary” in attack prompts, researchers demonstrate an increased likelihood of a jailbreak occurring on FLAN and Llama2-7B.

Blog

Related articles

March 2, 2022
-
4
minute read

Make RIME Yours (with Custom Tests)

For:
December 1, 2021
-
4
minute read

Marco Sanvido: Paving the Way for the DevOps Engineering Team

For:
February 10, 2022
-
4
minute read

Bias in Hiring, the EEOC, and How RI Can Help

For:
Compliance Teams
No items found.