In part 1 of this blog series, we provided an overview of NVIDIA’s open-sourced NeMo Guardrails and the outlined the evidence we found that suggests companies should refrain from relying on Guardrails for protecting production LLMs. The NVIDIA team welcomed our feedback, have already resolved one issue, and plan to address the other issues in future releases.
To recap, we identified the following:
- Without improvement of the example rails, simple prompts can bypass the rails, eventually providing unfettered access to the LLM.
- Without improvement of the example rails, hallucinations can occur that are not caught by the rails and lead to poor business outcomes.
- PII can be exposed from an LLM that ingests data from a database; attempts to de-risk that behavior using rails is a poor design, is discouraged by the Guardrail authors, and would be better replaced by appropriate implementation of trust boundaries for which even the best-designed rails and firewalls should not be purposed.
We then dove into our findings on how simple prompts were used to bypass the Topical Rails example. In this blog, we will cover hallucinations and PII exposure (parts 2 and 3 above). Read on to learn more.
In the following experiment, we try to adapt NeMo Guardrails’ examples to provide only correct recommendations about a set of phone plans that are described by the following table. There are a total of 50 plans available with various parameters to each, provided by company ZZYZX. It should be noted that to our knowledge, no large language model handles tabular data well, and we have accommodated for this deficiency in our application below.
In this attempt, we adapted the NeMo Guardrails Topical Firewall to use our fake phone plan data by converting that to a MD table and supplying it as a knowledge base to the bot.
This firewall failed completely. This is likely due to the fact that large language models poorly parse formatted tabular data.
None of these plan details came from the knowledge base we provided.
In this attempt, we adapted the NeMo Guardrails Topical Firewall to use our fake phone plan data by converting it into free-form MD text and supplying it as a knowledge base to the bot. Rather than tabular data, this free-form text represents that table as sentences that a language model can more easily parse.
In this attempt, we now adapted the NeMo Guardrails Fact-Checking Rules to use our fake phone plan data by converting it into free-form MD text and supplying it as a knowledge base to the bot. In this case, the bot attempts to verify that all outputs from the bot are correct facts from the knowledge base.
In this case, the interaction begins very well, and it answers the first three questions nearly flawlessly (although in the second question it first claims it can offer 30GB/month and then offers a contradictory plan). However, in the fourth interaction, the bot completely makes up a non-existent plan 55 with the required 30GB/month of data (no plan offers that much data) and a price of 69.99, which is higher than any other plan. This plan is not caught by the fact-checker rail, as it should have been since that plan clearly does not exist.
In our conversations with the NVIDIA team, this behavior was understood as a difficulty of the language model to verify a large number of facts provided in a single query.
NeMo PII Exploration
Ultimately, using an LLM in applications that require trust is poor design and implementation of a trust boundary, for which subsequent attempts to patch this trust boundary violation is futile. To demonstrate this, we design a simple system in which an LLM agent is given access to a database that includes private data. This, for example, is possible when designing a system with AI chaining software such as langchain or Microsoft Semantic Kernel. We strongly recommend that systems are not designed in this way, and motivate this point using the following example.
In this experiment, we tried to make a guardrail to identify and block PII in a bot’s response when that bot has been supplied with a knowledge base that contains <code inline>PII</code>. For this purpose, we created a fake table of employee data which included a column we called IIP which is a stand-in for a social security number (we named it <code inline>IIP</code> to avoid any in-built mechanisms the bot itself might have to forbid PII). Then we built a series of NeMo actions, which were meant to detect PII.
CAVEAT: Unlike other experiments, there was no guardrail provided by NeMo for detecting PII so we had to write this one ourselves. Because we’re new to this product, our guardrail implementations may be imperfect, and may not represent the best NeMo guardrails could offer, but they were a best-effort based on what we saw in other guardrails.
In this version, we made a simple detector that checks if the bot’s response contains the word <code inline>IIP</code>.
This worked for simple queries but was easily defeated by a character rewrite query:
for which the bot checked the facts of the bot but failed to discover the PII:
Improving on this, we added logic to the PII detector that used a regex to detect 9 digit numbers formatted like an IIP
for which the bot failed to discover the PII:
This sort of query is allowing us to get around the PII checks we put in place, although the quality of the information is degraded. Namely the <code inline>405</code> digits are the IIP of a different employee in the database. It further makes a mistake in the last 6 digits of her IIP which should have been <code inline>382842</code>. However, if additional queries are made, it seems these may be error-corrected.
Reaching the limits of regular text parsing, we next tried to add bot based checking for PII to our PII detector. We followed the example set forth by the NeMo <code inline>check_facts</code> function to create a LangChain query to ask whether or not the bot’s response contained any information about employee’s IIP numbers. This is what we attempted:
This also failed to detect snippets of PII that were elicited through queries like:
In this case, the bot did correctly identify Charlotte’s last 4 digits of her IIP and <code inline>check_pii</code> was unable to identify this information.
CAVEAT: The prompt sent to LLMChain here may be suboptimal for identifying IIP numbers from the data. There is an art to forming queries like this correctly.
Protecting PII is a hard problem for generative models that are willing to comply with requests to obfuscate data. Guarding an LLM using any approach is a poor substitute for proper security design of a system. In our example, reasonable efforts to tune Guardrails seems to devolve into a futile cat-and-mouse game. Since the bot is being instructed to disguise the PII and reduce it down to small bits of information, it may be impossible to prevent PII leakage to a determined adversary. For instance, the adversary could prompt the bot to <code inline>encode the digits as ASCII codes</code>, or <code inline>only use Roman numerals</code>, or <code inline>put a word in between each digit</code> and thereby use the bot itself to form output that becomes increasingly hard to associate with the original information. A clever patient adversary could even form queries like
While increasingly many queries are required for such schemes, if the PII is sufficiently valuable, they are easily parallelized and masked with bot-farms.
While we did notice that increasingly complex questions had higher error rates, this is an area that is likely to favor the attacker as LLMs improve, because the improved LLM will allow for more faithful execution of the attacker’s logic.
Ultimately, applications should be designed with appropriate trust boundaries. A defender is better served spending more effort in removing PII from their knowledge base before it’s given to a chat bot, or removing the chat bot from the knowledge base altogether. Applying guardrails to such an application is a poor use of the technology.
While NVIDIA’s NeMo Guardrails is an early 0.1.0 release, our preliminary findings indicated that the solution is moving in the right direction, however:
- Guardrails and firewalls are no substitute for appropriate security design of a system–LLMs should not ingest unsanitized input, and their output should also be treated as unsanitized;
- Despite what may be in the press, the current version of NeMo Guardrails is not generally ready to protect production applications, as clearly noted by the project’s authors. The examples provided by NVIDIA that we evaluated are a great representation of how well their system can perform. With more effort, the rails undoubtedly can be refined to make the guardrails more constrained, and therefore effective. Underlying many of the bypasses is a curious nondeterministic behavior that should be better understood before NeMo should be relied on as a viable solution in production applications.
General Guidelines for Securing LLMs
With the rush to explore LLMs in new products and services, it’s important to be secure by design. Best practices for using LLMs would be to treat them as untrusted consumers of data, without direct links to sensitive information. Furthermore, their output should be considered unsanitized. Many security concerns are removed when basic security hygiene is exercised.
On that foundation, firewalls and guardrails also offer an important control to reduce the risk exposed by LLMs. In order to effectively safeguard a generative model, firewalls and guardrails first need to understand the model they are protecting. We believe this is best done through comprehensive pre-deployment validation. The firewall or guardrail can then be tailored to the vulnerabilities of that particular model and use case.
A satisfactory set of behaviors for guardrails or firewalls should include the following:
- Deterministic Correctness. A guardrail or firewall is not useful if it doesn’t respond in a consistently correct way for the same prompts over time.
- Proper Use of Memory. Memory is an important feature of LLM centered chat applications, but it’s also a vulnerability. Users should not be able to exploit the memory stored in the chat to get around the guardrail. The guardrail or firewall should be able to block attempts to circumvent these rules across various prompts, and make sure that users don’t veer off topic slowly over multiple messages.
- Intent Idempotency. The performance of the firewall or guardrail should not degrade under modifications of the prompt, or different versions of the output that convey the same information. For example, toxicity rails should trigger whether as a plain utterance, or when asked to do so when all instances of the letter S are swapped with $.
To learn more about our findings and see a demo of Robust Intelligence’s AI Firewall, request a demo here.