Bypassing LLM guardrails: character and AML attacks in practice

Updated on

December 10, 2025

This study shows how simple character transformations and algorithmic evasion attacks can silently bypass six popular LLM guardrails, sometimes reaching one hundred percent evasion.

Lewis Birch

TABLE OF CONTENTS

Key Takeaways

We evaluate twelve character injection techniques and eight adversarial ML evasion methods against six prompt injection and jailbreak detectors.
Systems tested include Azure Prompt Shield, Meta Prompt Guard, ProtectAI’s prompt injection models, NeMo Guard Jailbreak Detect, and Vijil Prompt Injection.
Character injection alone achieves very high Attack Success Rates against most guardrails while keeping prompts understandable to the underlying LLM.
Adversarial ML evasion further increases bypass rates and can be tuned on white-box models then transferred to black-box targets.
Responsible disclosure was completed with all affected parties before publication.

Guardrails have become the go to answer for controlling large language models. Vendors wrap classifiers and filters around LLMs to detect prompt injections, jailbreak attempts, and harmful content. On paper, this seems reassuring. In practice, any classifier can be attacked.

This research puts that assumption to the test. We systematically evaluate how well six widely used guardrails stand up to two classes of evasion attacks:

Character injection – simple transformations of text such as homoglyphs, zero-width characters, or spacing tricks that confuse detectors but remain readable to LLMs.
Adversarial ML evasion – algorithmic methods that perturb prompts based on word importance rankings while preserving semantics.

From Mindgard’s perspective, this work is important for one reason. It shows that wrapping a model in “AI that watches AI” does not remove the need for deeper, system level security thinking. If the guardrail itself is a classifier, it inherits the same weaknesses adversarial researchers have been exploiting for years.

‍

The guardrails under test

The paper targets six prompt injection and jailbreak detection systems:

Azure Prompt Shield within Azure AI Content Safety – a production service that classifies and blocks adversarial prompts, treated as a black-box model that only reveals labels.
ProtectAI Prompt Injection v1 and v2 – open-source classifiers fine tuned from DeBERTa-v3-base. Version 2 is not trained on jailbreaks and is evaluated accordingly.
Meta Prompt Guard – a multi-label classifier for direct jailbreaks and indirect prompt injections based on mDeBERTa-v3-base.
NeMo Guard Jailbreak Detect from Nvidia – a lightweight random forest model that uses pre-trained embeddings to flag jailbreak prompts.
Vijil Prompt Injection – a ModernBERT-based classifier focused on injection attempts that try to manipulate an LLM.

For all but Azure Prompt Shield, we have white-box access including confidence scores or logits. Azure remains a true black-box target.

Guardrails are deployed in front of GPT-4o-mini so that prompts must pass detection before they can reach the underlying LLM.

‍

The attack datasets

To test realism, we use two datasets:

A prompt injection dataset called safe-guard-prompt-injection, from which they extract 476 adversarial injection prompts.
A set of 78 jailbreak prompts collected from an open-source corpus.

For each sample, they measure baseline detection performance, then apply evasion techniques and measure how often the guardrail misclassifies the modified prompt as benign. This ratio is the Attack Success Rate.

‍

Character injection in detail

Character injection attacks exploit the gap between how humans (and LLMs) interpret text and how classifiers tokenize it. The paper explores twelve techniques, including:

Replacing letters with visually similar characters (homoglyphs).
Inserting zero-width characters between letters.
Mapping characters to numbers in leetspeak style transformations.
Adding diacritics, underlines, full-width transformations, bidirectional text, and more.

Each technique is applied automatically to malicious prompts. The transformed text still conveys the same meaning to the underlying LLM but no longer matches the patterns the guardrail was trained on.

Results show that character injection alone is enough to significantly degrade detection across nearly all systems. Some guardrails exhibit average Attack Success Rates above 80 percent for certain techniques and threat types.

This should not be surprising. Classifiers that rely on specific character distributions or wordpiece tokenization are brittle when attackers introduce Unicode edge cases or non-standard encodings. LLMs, on the other hand, are robust and can often infer the intended prompt despite heavy text mangling.

‍

Adversarial ML evasion

The second class of attacks uses established text adversarial techniques implemented via the TextAttack library. These include:

Bert-Attack and BAE – methods that use masked language models to propose context aware token substitutions.
DeepWordBug – character level perturbations targeted at high importance tokens.
Alzantot, TextFooler, and PWWS – word level perturbations based on probability and word salience.
Pruthi and TextBugger – perturbations that act as adversarial spelling mistakes or semantic preserving edits.

Each technique follows a two step pattern:

Compute a ranking of which words most influence the classifier’s decision.
Iteratively perturb those words while querying the model until it misclassifies the prompt as safe.

For white-box guardrails, word importance can be estimated directly from gradients or confidence scores. For black-box targets like Azure Prompt Shield, we show that word rankings computed from open-source white-box models can still be used to guide attacks that transfer.

This is a powerful result. It means an attacker does not need direct insight into the victim guardrail. They can train against a similar model, learn which words matter most for classification, and then reuse that knowledge to attack a different system.

‍

How bad is it in practice?

The combined picture from both sets of experiments is sobering:

Many guardrails are highly vulnerable to simple character injection, especially those that were not explicitly hardened against Unicode tricks.
Adversarial ML evasion methods are also effective, often achieving high evasion rates with relatively small changes to prompts.
Even Azure Prompt Shield, a production cloud service, can be bypassed with character injection and transferred evasion attacks that never reveal its internals.

For defenders, the key message is that guardrails are not a shield that can be bolted on and forgotten. If they depend heavily on static classifiers trained once on fixed datasets, they will accumulate blind spots as attack techniques evolve.

‍

Responsible disclosure and broader implications

We followed a responsible disclosure process with all parties and completed final notifications in 2025 before publishing their findings.

From Mindgard’s vantage point, this aligns with how offensive security research should work. The goal is not to embarrass vendors, but to stress test assumptions and push the ecosystem toward more resilient designs.

This research suggests several directions for improving LLM defense:

Defense in depth – Guardrails should be one of several layers that control what an LLM can do, not the single barrier between benign and catastrophic behavior.
Adversarially informed training – Classifiers should be trained and continuously updated using adversarial examples that include character injection and semantic preserving perturbations.
Behavioral and context aware checks – Where possible, rely less on single shot text classification and more on system level signals such as tool usage patterns, data access scope, and cross session behavior.

Guardrails remain a necessary tool for governing LLM behavior at scale, but they are not magic. As long as they are themselves machine learning systems, they will share the same weaknesses adversarial researchers have been exploiting for over a decade.

The work on guardrail evasion reinforces a broader message that runs through Mindgard’s research. AI security cannot be delegated to a thin safety wrapper at the edge. It has to be grounded in a realistic understanding of how attackers probe, adapt, and abuse every layer of the stack.

Read the full paper on ArXiv

‍

Compilation as defense: using tensor optimizations to confuse attackers

This work shows how applying compiler driven tensor optimizations can cut side-channel model reconstruction success by up to forty-three percent without redesigning architectures.

PINCH: An Adversarial Extraction Attack Framework for Deep Learning Models

PINCH is an efficient and automated extraction attack framework.

Model Leeching: cheaply cloning LLM skills for attack staging

Model Leeching shows how attackers can distill ChatGPT-class task knowledge into smaller models for about fifty dollars, then use them to tune follow on attacks.

Mindgard, the leading provider of Artificial Intelligence security solutions, helps enterprises secure their AI models, agents, and systems across the entire lifecycle. Mindgard’s solution uncovers shadow AI, conducts automated AI red teaming by emulating adversaries, and delivers runtime protection against attacks like prompt injection and agentic manipulation. Trusted by leading organizations in finance, healthcare, and technology, Mindgard is backed by investors including .406 Ventures, IQ Capital, Atlantic Bridge, and Lakestar.