Guardrails have become the go to answer for controlling large language models. Vendors wrap classifiers and filters around LLMs to detect prompt injections, jailbreak attempts, and harmful content. On paper, this seems reassuring. In practice, any classifier can be attacked.
This research puts that assumption to the test. We systematically evaluate how well six widely used guardrails stand up to two classes of evasion attacks:
- Character injection – simple transformations of text such as homoglyphs, zero-width characters, or spacing tricks that confuse detectors but remain readable to LLMs.
- Adversarial ML evasion – algorithmic methods that perturb prompts based on word importance rankings while preserving semantics.
From Mindgard’s perspective, this work is important for one reason. It shows that wrapping a model in “AI that watches AI” does not remove the need for deeper, system level security thinking. If the guardrail itself is a classifier, it inherits the same weaknesses adversarial researchers have been exploiting for years.
The guardrails under test
The paper targets six prompt injection and jailbreak detection systems:
- Azure Prompt Shield within Azure AI Content Safety – a production service that classifies and blocks adversarial prompts, treated as a black-box model that only reveals labels.
- ProtectAI Prompt Injection v1 and v2 – open-source classifiers fine tuned from DeBERTa-v3-base. Version 2 is not trained on jailbreaks and is evaluated accordingly.
- Meta Prompt Guard – a multi-label classifier for direct jailbreaks and indirect prompt injections based on mDeBERTa-v3-base.
- NeMo Guard Jailbreak Detect from Nvidia – a lightweight random forest model that uses pre-trained embeddings to flag jailbreak prompts.
- Vijil Prompt Injection – a ModernBERT-based classifier focused on injection attempts that try to manipulate an LLM.
For all but Azure Prompt Shield, we have white-box access including confidence scores or logits. Azure remains a true black-box target.
Guardrails are deployed in front of GPT-4o-mini so that prompts must pass detection before they can reach the underlying LLM.
The attack datasets
To test realism, we use two datasets:
- A prompt injection dataset called safe-guard-prompt-injection, from which they extract 476 adversarial injection prompts.
- A set of 78 jailbreak prompts collected from an open-source corpus.
For each sample, they measure baseline detection performance, then apply evasion techniques and measure how often the guardrail misclassifies the modified prompt as benign. This ratio is the Attack Success Rate.
Character injection in detail
Character injection attacks exploit the gap between how humans (and LLMs) interpret text and how classifiers tokenize it. The paper explores twelve techniques, including:
- Replacing letters with visually similar characters (homoglyphs).
- Inserting zero-width characters between letters.
- Mapping characters to numbers in leetspeak style transformations.
- Adding diacritics, underlines, full-width transformations, bidirectional text, and more.
Each technique is applied automatically to malicious prompts. The transformed text still conveys the same meaning to the underlying LLM but no longer matches the patterns the guardrail was trained on.
Results show that character injection alone is enough to significantly degrade detection across nearly all systems. Some guardrails exhibit average Attack Success Rates above 80 percent for certain techniques and threat types.
This should not be surprising. Classifiers that rely on specific character distributions or wordpiece tokenization are brittle when attackers introduce Unicode edge cases or non-standard encodings. LLMs, on the other hand, are robust and can often infer the intended prompt despite heavy text mangling.
Adversarial ML evasion
The second class of attacks uses established text adversarial techniques implemented via the TextAttack library. These include:
- Bert-Attack and BAE – methods that use masked language models to propose context aware token substitutions.
- DeepWordBug – character level perturbations targeted at high importance tokens.
- Alzantot, TextFooler, and PWWS – word level perturbations based on probability and word salience.
- Pruthi and TextBugger – perturbations that act as adversarial spelling mistakes or semantic preserving edits.
Each technique follows a two step pattern:
- Compute a ranking of which words most influence the classifier’s decision.
- Iteratively perturb those words while querying the model until it misclassifies the prompt as safe.
For white-box guardrails, word importance can be estimated directly from gradients or confidence scores. For black-box targets like Azure Prompt Shield, we show that word rankings computed from open-source white-box models can still be used to guide attacks that transfer.
This is a powerful result. It means an attacker does not need direct insight into the victim guardrail. They can train against a similar model, learn which words matter most for classification, and then reuse that knowledge to attack a different system.
How bad is it in practice?
The combined picture from both sets of experiments is sobering:
- Many guardrails are highly vulnerable to simple character injection, especially those that were not explicitly hardened against Unicode tricks.
- Adversarial ML evasion methods are also effective, often achieving high evasion rates with relatively small changes to prompts.
- Even Azure Prompt Shield, a production cloud service, can be bypassed with character injection and transferred evasion attacks that never reveal its internals.
For defenders, the key message is that guardrails are not a shield that can be bolted on and forgotten. If they depend heavily on static classifiers trained once on fixed datasets, they will accumulate blind spots as attack techniques evolve.
Responsible disclosure and broader implications
We followed a responsible disclosure process with all parties and completed final notifications in 2025 before publishing their findings.
From Mindgard’s vantage point, this aligns with how offensive security research should work. The goal is not to embarrass vendors, but to stress test assumptions and push the ecosystem toward more resilient designs.
This research suggests several directions for improving LLM defense:
- Defense in depth – Guardrails should be one of several layers that control what an LLM can do, not the single barrier between benign and catastrophic behavior.
- Adversarially informed training – Classifiers should be trained and continuously updated using adversarial examples that include character injection and semantic preserving perturbations.
- Behavioral and context aware checks – Where possible, rely less on single shot text classification and more on system level signals such as tool usage patterns, data access scope, and cross session behavior.
Guardrails remain a necessary tool for governing LLM behavior at scale, but they are not magic. As long as they are themselves machine learning systems, they will share the same weaknesses adversarial researchers have been exploiting for over a decade.
The work on guardrail evasion reinforces a broader message that runs through Mindgard’s research. AI security cannot be delegated to a thin safety wrapper at the edge. It has to be grounded in a realistic understanding of how attackers probe, adapt, and abuse every layer of the stack.
Read the full paper on ArXiv