March 27, 2024
-
5
minute read

AI Cyber Threat Intelligence Roundup: March 2024

Threat Intelligence

At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: March 2024

ArtPrompt: ASCII Art-based Jailbreak Attacks

ArtPrompt is a novel ASCII art-based jailbreak technique designed to bypass LLM safety measures, which are primarily focused on query semantics, with visually encoded representations of specific harmful words.

The technique follows a simple two-step process which involves first masking sensitive words in a prompt that might trigger rejection by an LLM and then replacing those masked words with ASCII art representations. When the resulting prompt is provided to the model, it struggles to interpret the obfuscated keywords but still attempts to address the overall query which leads it to output unsafe content that would otherwise be blocked.

Notably, this approach is shown to be effective (52% ASR) against several state-of-the-art LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) with only black-box access. It’s easy for attackers to execute with a simple ASCII art generator, while current defense measures like perplexity thresholding and prompt paraphrasing offer limited protection.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots

Multi-round Jailbreaking: Contextual Interaction Attack

A new jailbreak technique known as a “Contextual Interaction Attack” exploits the context-dependent nature of LLMs by subtly guiding a target model to produce harmful outputs over a series of interactions.

This technique relies on an auxiliary LLM that automatically generates a series of harmless preliminary questions relevant to the ultimate attack query. The attacker poses these preliminary questions to the target LLM individually over several rounds of interaction, and the responses become part of the growing context along with the questions. When the ultimate query is posed, the LLM is steered by the cumulative context into providing harmful information rather than flagging it as unsafe.

The Contextual Interaction Attack has demonstrated a high attack success rate against multiple state-of-the-art LLMs and is easily transferable across models. It threatens to subvert LLMs deployed for sensitive applications such as content moderation, customer support, healthcare, and so on. Traditional input filtering methods will likely prove ineffective against this technique because of its subtle steering over several prompts.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots

ICLAttack: In-context Learning Backdoor

A recently published research paper introduces a technique known as ICLAttack, which exploits the in-context learning capabilities of LLMs in order to introduce a backdoor trigger. This trigger remains dormant until specific conditions are met—a specific word within a prompt or some special string, for example—and the malicious action is triggered.

The ICLAttack technique proves highly effective with a success rate of 95%, but its practical usefulness for real-world attacks remains questionable. Similar to the BadChain chain-of-thought backdoor we mentioned in last month’s threat roundup, the trigger only persists for the duration of the chat session where it is introduced. It’s unlikely that an adversary would be able to control in-context learning examples in a way that affects the output of other users accessing the same LLM. Risk may exist if an LLM application uses user prompts for future training or providing some type of feedback loop into the model or application.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots

More Threats to Explore

Google AI search promotes malicious sites that direct users to install malicious browser extensions, subscribe to spam notifications, and engage in various other scams. These results appear in the new Google Search Generative Experience (SGE) and exhibit similar characteristics to one another, indicating that they are all part of a larger SEO poisoning campaign.

The first-known attack on AI workloads was identified in the wild targeting a vulnerability in Ray, an open-source AI framework. Thousands of businesses and servers may be affected and are susceptible to theft of their computing resources and internal data. At the present time, no patch is available for this vulnerability.

To receive our monthly AI threat round-up, signup for our AI Security Insider newsletter.

March 27, 2024
-
5
minute read

AI Cyber Threat Intelligence Roundup: March 2024

Threat Intelligence

At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: March 2024

ArtPrompt: ASCII Art-based Jailbreak Attacks

ArtPrompt is a novel ASCII art-based jailbreak technique designed to bypass LLM safety measures, which are primarily focused on query semantics, with visually encoded representations of specific harmful words.

The technique follows a simple two-step process which involves first masking sensitive words in a prompt that might trigger rejection by an LLM and then replacing those masked words with ASCII art representations. When the resulting prompt is provided to the model, it struggles to interpret the obfuscated keywords but still attempts to address the overall query which leads it to output unsafe content that would otherwise be blocked.

Notably, this approach is shown to be effective (52% ASR) against several state-of-the-art LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) with only black-box access. It’s easy for attackers to execute with a simple ASCII art generator, while current defense measures like perplexity thresholding and prompt paraphrasing offer limited protection.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots

Multi-round Jailbreaking: Contextual Interaction Attack

A new jailbreak technique known as a “Contextual Interaction Attack” exploits the context-dependent nature of LLMs by subtly guiding a target model to produce harmful outputs over a series of interactions.

This technique relies on an auxiliary LLM that automatically generates a series of harmless preliminary questions relevant to the ultimate attack query. The attacker poses these preliminary questions to the target LLM individually over several rounds of interaction, and the responses become part of the growing context along with the questions. When the ultimate query is posed, the LLM is steered by the cumulative context into providing harmful information rather than flagging it as unsafe.

The Contextual Interaction Attack has demonstrated a high attack success rate against multiple state-of-the-art LLMs and is easily transferable across models. It threatens to subvert LLMs deployed for sensitive applications such as content moderation, customer support, healthcare, and so on. Traditional input filtering methods will likely prove ineffective against this technique because of its subtle steering over several prompts.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots

ICLAttack: In-context Learning Backdoor

A recently published research paper introduces a technique known as ICLAttack, which exploits the in-context learning capabilities of LLMs in order to introduce a backdoor trigger. This trigger remains dormant until specific conditions are met—a specific word within a prompt or some special string, for example—and the malicious action is triggered.

The ICLAttack technique proves highly effective with a success rate of 95%, but its practical usefulness for real-world attacks remains questionable. Similar to the BadChain chain-of-thought backdoor we mentioned in last month’s threat roundup, the trigger only persists for the duration of the chat session where it is introduced. It’s unlikely that an adversary would be able to control in-context learning examples in a way that affects the output of other users accessing the same LLM. Risk may exist if an LLM application uses user prompts for future training or providing some type of feedback loop into the model or application.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots

More Threats to Explore

Google AI search promotes malicious sites that direct users to install malicious browser extensions, subscribe to spam notifications, and engage in various other scams. These results appear in the new Google Search Generative Experience (SGE) and exhibit similar characteristics to one another, indicating that they are all part of a larger SEO poisoning campaign.

The first-known attack on AI workloads was identified in the wild targeting a vulnerability in Ray, an open-source AI framework. Thousands of businesses and servers may be affected and are susceptible to theft of their computing resources and internal data. At the present time, no patch is available for this vulnerability.

To receive our monthly AI threat round-up, signup for our AI Security Insider newsletter.

Blog

Related articles

June 21, 2021
-
1
minute read

How To Secure AI Systems @ Stanford MLSys Seminar

For:
May 24, 2021
-
2
minute read

Business Alliance with Tokio Marine

For:
September 13, 2021
-
6
minute read

How NTT DATA Uses RIME to Increase Model Performance by 70%

For:
No items found.