MITRE ATLAS™ Adviser in Mindgard that helps standardise AI red teaming reporting.
Lewis Birch

Guardrails have become the go to answer for controlling large language models. Vendors wrap classifiers and filters around LLMs to detect prompt injections, jailbreak attempts, and harmful content. On paper, this seems reassuring. In practice, any classifier can be attacked.
This research puts that assumption to the test. We systematically evaluate how well six widely used guardrails stand up to two classes of evasion attacks:
From Mindgard’s perspective, this work is important for one reason. It shows that wrapping a model in “AI that watches AI” does not remove the need for deeper, system level security thinking. If the guardrail itself is a classifier, it inherits the same weaknesses adversarial researchers have been exploiting for years.
The paper targets six prompt injection and jailbreak detection systems:
For all but Azure Prompt Shield, we have white-box access including confidence scores or logits. Azure remains a true black-box target.
Guardrails are deployed in front of GPT-4o-mini so that prompts must pass detection before they can reach the underlying LLM.
To test realism, we use two datasets:
For each sample, they measure baseline detection performance, then apply evasion techniques and measure how often the guardrail misclassifies the modified prompt as benign. This ratio is the Attack Success Rate.
Character injection attacks exploit the gap between how humans (and LLMs) interpret text and how classifiers tokenize it. The paper explores twelve techniques, including:
Each technique is applied automatically to malicious prompts. The transformed text still conveys the same meaning to the underlying LLM but no longer matches the patterns the guardrail was trained on.
Results show that character injection alone is enough to significantly degrade detection across nearly all systems. Some guardrails exhibit average Attack Success Rates above 80 percent for certain techniques and threat types.
This should not be surprising. Classifiers that rely on specific character distributions or wordpiece tokenization are brittle when attackers introduce Unicode edge cases or non-standard encodings. LLMs, on the other hand, are robust and can often infer the intended prompt despite heavy text mangling.
The second class of attacks uses established text adversarial techniques implemented via the TextAttack library. These include:
Each technique follows a two step pattern:
For white-box guardrails, word importance can be estimated directly from gradients or confidence scores. For black-box targets like Azure Prompt Shield, we show that word rankings computed from open-source white-box models can still be used to guide attacks that transfer.
This is a powerful result. It means an attacker does not need direct insight into the victim guardrail. They can train against a similar model, learn which words matter most for classification, and then reuse that knowledge to attack a different system.
The combined picture from both sets of experiments is sobering:
For defenders, the key message is that guardrails are not a shield that can be bolted on and forgotten. If they depend heavily on static classifiers trained once on fixed datasets, they will accumulate blind spots as attack techniques evolve.
We followed a responsible disclosure process with all parties and completed final notifications in 2025 before publishing their findings.
From Mindgard’s vantage point, this aligns with how offensive security research should work. The goal is not to embarrass vendors, but to stress test assumptions and push the ecosystem toward more resilient designs.
This research suggests several directions for improving LLM defense:
Guardrails remain a necessary tool for governing LLM behavior at scale, but they are not magic. As long as they are themselves machine learning systems, they will share the same weaknesses adversarial researchers have been exploiting for over a decade.
The work on guardrail evasion reinforces a broader message that runs through Mindgard’s research. AI security cannot be delegated to a thin safety wrapper at the edge. It has to be grounded in a realistic understanding of how attackers probe, adapt, and abuse every layer of the stack.