Al Security Risks

A taxonomy of the most common Al safety and security threats.

AI Security Necessitates a New Approach

The adoption of AI applications introduces an entirely new set of safety and security risks which differ from traditional cybersecurity challenges. Conventional approaches focus on protecting systems and data from unauthorized access and known vulnerabilities, but do not effectively secure the attack surface of AI applications and their underlying models. New algorithmic attack methodologies only add to this complexity because their patterns are dynamic and scalable.

AI application security necessitates a fundamentally new approach which considers threats to every stage of the AI lifecycle, from data poisoning and model backdoors to prompt injection and sensitive data leakage. We developed this taxonomy to educate organizations on the top risks to AI security today, complete with descriptions, examples, and mitigation techniques.
Alert/Risk logo
Alert/Risk logo
Alert/Risk logo
Alert/Risk logo
Alert/Risk logo
Alert/Risk logo

Top AI Security and Safety Risks

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Production Risk
All Resources

Prompt Injection (Direct)

Direct prompt injections are adversarial attacks that attempt to alter or control the output of an LLM by providing instructions via prompt that override existing instructions. These outputs can include harmful content, misinformation, or extracted sensitive information such as PII or model instructions.

Impact:
Mitigation:

AI Firewall

Standards:
MITRE ATLAS

AML.T0051.000 - LLM Prompt Injection: Direct

OWASP TOP 10 for LLM Applications

LLM01 - Prompt Injection

Production Risk
All Resources

Prompt Injection (Indirect)

Indirect prompt injection requires an adversary to control or manipulate a resource consumed by an LLM such as a document, website, or content retrieved from a database. This can direct the model to expose data or perform a malicious action like distribution of a phishing link.

Impact:
Mitigation:

AI Firewall; sanitize data sources

Standards:
MITRE ATLAS

AML.T0051.001 - LLM Prompt Injection: Indirect

OWASP TOP 10 for LLM Applications

LLM01 - Prompt Injection

Production Risk
All Resources

Jailbreaks

A jailbreak refers to any prompt-based attack designed to bypass model safeguards to produce LLM outputs that are inappropriate, harmful, or unaligned with the intended purpose. With well-crafted prompts, adversaries can access restricted functionalities or data, and compromise the integrity of the model itself.

Impact:
Mitigation:

AI Firewall

Standards:
MITRE ATLAS

AML.T0054 - LLM Jailbreak Injection: Direct

OWASP TOP 10 for LLM Applications

LLM01 - Prompt Injection

Production Risk
All Resources

Meta Prompt Extraction

Meta prompt extraction attacks aim to derive a system prompt which effectively guides the behavior of a LLM application. This information can be exploited by attackers and harm the intellectual property, competitive advantage, and reputation of a business.

Impact:
Mitigation:

AI Firewall

Standards:
MITRE ATLAS

AML.T0051.000 - LLM Prompt Injection: Direct

Production Risk
All Resources

Sensitive Information Disclosure

Sensitive information disclosure refers to any instance where confidential or sensitive data such as PII or business records is exposed through vulnerabilities in an AI application. These privacy violations can lead to loss of trust and result in legal or regulatory consequences.

Impact:
Mitigation:

AI Firewall; sanitize data sources

Standards:
OWASP TOP 10 for LLM Applications

LLM06 - Sensitive Information Disclosure

Production Risk
All Resources

Privacy Attack

A privacy attack refers broadly to any attack aimed at extracting sensitive information from an AI model or its data. This category includes model extraction, which recreates a functionally equivalent model by probing target model outputs, and membership inference attacks, which determine if specific data records were used for model training.

Impact:
Mitigation:

Data scrubbing; AI Firewall

Standards:
MITRE ATLAS

AML.T0024.000 - Infer Training Data Membership

AML.T0024.001 - Invert ML Model

AML.T0024.002 - Extract ML Model

OWASP TOP 10 for LLM Applications

LLM06 - Sensitive Information Disclosure

Development Risk
All Resources

Training Data Poisoning

Training data poisoning is the deliberate manipulation of training data in order to compromise the integrity of an AI model. This can lead to skewed or biased outputs, backdoor trigger insertions—malicious links, for example—and ultimately, a loss of user trust.

Impact:
Mitigation:

Sanitize training data

Standards:
MITRE ATLAS

AML.T0020 - Poison Training Data

OWASP TOP 10 for LLM Applications

LLM03 - Training Data Poisoning

Production Risk
All Resources

Factual Inconsistency

Generated text contains information that is not accurate or true while being presented in a plausible manner, also known as hallucinations. This may include incorrect details, made-up facts, mismatches with known information, or entirely fictional details.

Impact:
Mitigation:

AI Firewall; RLHF; data cleaning and filtering

Standards:
MITRE ATLAS

AML.T0048.002 - Societal Harm

OWASP TOP 10 for LLM Applications

LLM01: Prompt Injection

Production Risk
All Resources

Denial of Service

An attack designed to degrade or shut down an ML model by flooding the system with requests, requesting large responses, or exploiting a vulnerability.

Impact:
Mitigation:

AI Firewall; rate limiting; token counting

Standards:
MITRE ATLAS

AML.T0029 - Denial of ML Service

OWASP TOP 10 for LLM Applications

LLM04 - Model Denial of Service

Production Risk
All Resources

Cost Harvesting

Threat actors using a model in a way the developer did not intend while increasing the cost of running services at the target organization. For example, an attacker could jailbreak a public-facing customer support chatbot in order to generate Python code or perform some other task the application was not designed for.

Impact:
Mitigation:

AI Firewall

Standards:
MITRE ATLAS

AML.T0034 - Cost Harvesting

Production Risk
All Resources

Exfiltration

Techniques used to get data out of a target network. Exfiltration of ML artifacts (e.g., data from privacy attacks) or other sensitive information.

Impact:
Mitigation:

AI Firewall; threat modeling application

Standards:
MITRE ATLAS

AML.T0025 - Exfiltration via Cyber Means

Production Risk
All Resources

Toxicity

Unintended responses that are offensive to the user. This can include hate speech and discrimination, profanity, sexual content, violence, harassment, unsafe actions, and more.

Impact:
Mitigation:

AI Firewall

Standards:
MITRE ATLAS

AML.T0048.002 - Societal Harm

OWASP TOP 10 for LLM Applications

LLM01: Prompt Injection

Production Risk
All Resources

Misalignment

Discrepancy between the model's behavior and the intended objectives or values of its developers and users. This may present as a misalignment of goals, safety, values, or other specifications.

Impact:
Mitigation:

AI Validation

Standards:
MITRE ATLAS

AML.T0048.002 - Societal Harm

Development Risk
All Resources

Model Backdoor

Insertion of hidden backdoors into an ML model which can be triggered by specific inputs to cause a specific, unexpected output or grant some level of unauthorized access.

Impact:
Mitigation:

Validate ML model; Sanitize training data

Standards:
MITRE ATLAS

AML.T0018: Backdoor ML Model