Discovering an automated method for jailbreaking sophisticated Large Language Models, including GPT-4 and Llama 2.
AI security researchers from Robust Intelligence, in collaboration with Yale University, have discovered an automated adversarial machine learning technique known as the Tree of Attacks with Pruning (TAP). This method is capable of overriding the guardrails of sophisticated models with a high degree of success, a small number of queries, and without human oversight. In this comprehensive research report, covered topics include:
- An overview of TAP, a query-efficient black-box technique for jailbreaking LLMs
- In-depth explanation of our testing methodologies and success criteria
- Findings and key takeaways from the application of TAP against various LLMs