How to Bypass Azure AI Content Safety Guardrails

Updated on

January 14, 2025

Bypassing Azure AI Content Safety Guardrails

Discover a security issue within Azure AI Content Safety guardrails that Mindgard has discovered and reported to Microsoft.

William Hackett

TABLE OF CONTENTS

Key Takeaways

In February 2024, Mindgard discovered and responsibly disclosed two security vulnerabilities within Microsoft’s Azure AI Content Safety Service, whereby attackers could evade detection and bypass established GenAI guardrails.

Azure AI Content Safety is a cloud-based service from Microsoft Azure that helps developers build safety and security guardrails around AI applications by detecting and managing harmful or inappropriate content in user-generated text, images, and videos.

Two security vulnerabilities were discovered within the guardrails that the service provides, specifically in AI Text Moderation (blocks harmful content such as hate speech, sexual material, etc.) and Prompt Shield (protects AI models against jailbreaks and prompt injection).

Figure 1: Demonstration of two guardrails offered within Azure AI Content Safety

When deploying a Large Language Model (LLM) through Azure OpenAI, Prompt Shield is user activated, while AI Text Moderation guardrail is automatically enabled with configurable severity thresholds. These guardrails ensure that inputs and AI-generated content are validated, blocking any malicious activity based on set severity criteria and filters.

‍

Summary

We discovered two security vulnerabilities whereby an attacker is capable of bypassing Azure AI Content Safety by evading detection of underlying guardrails, and therefore propagate malicious content to the protected LLM. This vulnerability can have significant consequences, as it allows attackers to inject harmful content that can bypass security measures designed to prevent malicious or sensitive data sent and generated by the LLM from being processed.

Such an attack results in exposing confidential information, unauthorized access to internal systems, manipulation of outputs, or even the spread of misinformation. Left unchecked, this vulnerability can be exploited to launch broader attacks, undermine the trust in GenAI-based systems, and compromise the application integrity that rely on AI for decision-making and information processing.

Steps to Reproduce

The AI Text Moderation guardrail returns a severity score across four categories, with higher scores indicating more severe content. To assess evasion success, we set a severity threshold of 4, considering any reduction below this as a successful attempt. For evaluation, we use an open-source hate speech dataset extracted from a White Supremacist forum, containing 1,000 samples (109 labelled as hate speech and 891 as benign). With this dataset, the AI Text Moderation filter demonstrated a detection accuracy of 90%.

For the Prompt Shield guardrail, a binary classification model evasion is considered successful when the model changes its prediction from detecting a jailbreak to classifying the input as benign. We employ a curated set of jailbreak prompts to evaluate this filter’s effectiveness, which showed a detection accuracy of 89.03%.

Both guardrails—AI Text Moderation and Prompt Shield—were deployed in front of ChatGPT 3.5 Turbo using Azure OpenAI, treating them as black-box targets. The targets were then accessed via Mindgard Automated AI Red Teaming Platform.

We deployed two attack techniques against Azure AI Content Safety guardrails: Character Injection techniques and Adversarial ML Evasion techniques.

Attack Techniques

The goal of both attacks is to discover and exploit guardrail ‘blindspots’ to force misclassification and deceive malicious content detection. We systematically applied attack techniques both guardrail inputs and measure their vulnerability to bypass detection.

Character Injection techniques involve manipulating an LLM by injecting or replacing characters with the instructions or inputs provided to it. These techniques aim to alter the interpretation of text by the model without changing the apparent meaning to the human reader (Boucher et al. 2021). We employed 5 different methods of character injection to assess their effectiveness:

Diacritics (changing the letter ‘a’ to ‘á’)

Homoglyphs (closely resembling characters such as ‘0’ and ‘O’)

Numerical replacement (l33t speak)

Spaces (adding additional spaces between characters)

Zero-width (injecting ASCII code U+200B, zero-width space between characters)

Figure 2: Example of a Character Injection technique ‘Diacritics’ being applied to text to evade guardrail detection

Figure 3: Example of Bert-Attack making a minimal perturbation to evade guardrail detection

Adversarial ML Evasion techniques involve analyzing a text sample to identify the most important words based on a target classifier’s predictions. These techniques apply text perturbations (i.e. small alterations) according to word importance. Unlike character injection techniques, which alter individual characters or insert invisible characters, adversarial ML evasion focuses on modifying entire words. This can include substituting words with synonyms, introducing misspellings, or using other perturbations that maintain semantic meaning but disrupt the classifier’s ability to correctly process the text.

Figure 4: Attack flow for adversarial ML evasion techniques

The figure above illustrates the attack flow for the adversarial ML evasion techniques. Using our dataset, we select a sample and identify the most important words ranking based on the classification from the target. This method used for selection varies across the different attacks. We then initiate an iterative process of perturbing the text in order of important words, evaluating the success of each evasion attempt.

We selected 7 black-box techniques from literature, differing in word importance selection and perturbation methods. The selected techniques are as follows:

Bert-Attack (Li et al., 2020)

BAE (Garg and Ramakrishnan, 2020)

DeepWordBug (Gao et al., 2018)

Alazntot (Alzantot et al., 2018)

TextFooler (Jin et al., 2020)

PWWS (Ren et al., 2019)

Pruthi (Pruthi et al., 2020)‍

‍

Attack Results

Here we have reported the reduction of Azure AI Content Safety guardrail detection accuracy when exposed to of each attack techniques.

Figure 5: Text Moderation attack success

We observe that Character Injection frequently evaded AI Text Moderation guardrails across multiple techniques, reducing guardrail detection accuracy between 83.05% to 100%. Adversarial ML Evasion techniques, while less effective, reduced AI Text Moderation guardrail detection accuracy by up to 58.49%. Both techniques demonstrated notable effectiveness in bypassing AI text moderation safeguards.

Figure 6: Prompt Shield attack success

Similarly to AI Text Moderation, character injection techniques were very successful at evading correct classification by Prompt Guard, with four techniques reducing overall detection accuracy by 78.24% to 100%, while Zero Width characters increased the guardrails detection by 12.32%. Finally, we observe that adversarial ML evasion techniques were less effective when launched against Prompt Guard when compared to Text Moderation, only reducing guardrail detection by 5.13% to 12.82%.

Consequences

Using the outlined vulnerability an attacker can, in specific cases, reduce the effectiveness of the guardrails via evading correct detection by both the AI Text Moderation and Prompt Shield. This can result in harmful or inappropriate input reaching the LLM, causing the model to generate responses that violate its ethical, safety, and security guidelines. For example, without effective moderation, the LLM can inadvertently produce offensive, hateful, or dangerous content, such as promoting violence, or spreading misinformation.

Furthermore, bypassing jailbreak detection allows attackers to manipulate the LLM into breaking its own guardrails, potentially enabling it to carry out unethical tasks, provide prohibited information, or engage in manipulative conversations. This vulnerability not only undermines user trust but can also lead to real-world harm. For example, in chatbot applications where content policy filters are used to block harmful or inappropriate language in social media posts, an attacker could evade these filters by simply injecting zero-width characters or altering only the most important words leading to malicious detection.

Disclosure Timeline

This work was disclosed responsibly to Microsoft through the Microsoft Security Response Center (MSRC) Researcher Portal. We would like to thank Microsoft's assistance and support throughout the disclosure process. Below is the disclosure timeline:

20th February 2024: Discovery of Vulnerability
Identified that the injection of spaces and special characters could alter prediction outcomes for Microsoft’s guardrails.

1st March 2024: Internal Analysis
Deployed Mindgard’s AI red teaming platform against Azure AI guardrails to evaluate the success of evasion techniques.

4th March 2024: AI Text Moderation Vulnerability Submission
Submitted a detailed written report disclosing the vulnerability against AI Text Moderation guardrail to Microsoft. Report ID: VULN-120399. Case Number – 86269

7th March 2024: Acknowledgment from Microsoft
Microsoft acknowledged receipt of the report and the reported issue.

6th June 2024: Prompt Shield Vulnerability Submission
Additional disclosure opened to report the Prompt Shield guardrail vulnerability. Evading Prompt Shield - Report ID: VULN-127342. Case Number - 88558

18th June 2024: Disclosure Closed
The vulnerability was publicly disclosed. Microsoft assigned the issue a moderate severity level and indicated that their team is working on fixes to be released within upcoming updates.

Summing Up

We have presented and disclosed two new attack techniques to uncover security vulnerabilities within Azure AI Content Safety guardrails, whereby an attacker can evade detection and bypass LLM guardrails. This issue allows harmful content to slip through LLM guardrail detection, undermining the reliability, integrity, and safety of AI applications deployed. As of October 2024, Microsoft have deployed stronger mitigations to reduce the impact of this vulnerability. We strongly recommend heightened caution when utilizing guardrails, handling user generated inputs, and consider supplementary safeguards such as other AI moderation tools or deploying LLMs which are less susceptible to harmful content and jailbreaks.

We plan to publish a more detailed research paper on this topic in December 2024.
‍
‍

Blog References

Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.J., Srivastava, M. and Chang, K.W., 2018. Generating natural language adversarial examples. arXiv preprint. Available at: https://arxiv.org/abs/1804.07998

Boucher, N., Shumailov, I., Anderson, R. & Papernot, N., 2021. Bad Characters: Imperceptible NLP Attacks. arXiv. Available at: https://arxiv.org/abs/2106.09898

Dong, Y., Mu, R., Zhang, Y., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y., Hu, J., Meng, J., Bensalem, S., & Huang, X. (2024) 'Safeguarding large language models: A survey'. arXiv. Available at: https://arxiv.org/abs/2406.02622

Garg, S. and Ramakrishnan, G., 2020. BAE: BERT-based adversarial examples for text classification. In: B. Webber, T. Cohn, Y. He, and Y. Liu, eds. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 6174-6181. Available at: https://aclanthology.org/2020.emnlp-main.498

Gao, J., Lanchantin, J., Soffa, M.L. and Qi, Y., 2018. Black-box generation of adversarial text sequences to evade deep learning classifiers. arXiv preprint. Available at: https://arxiv.org/abs/1801.04354

Jin, D., Jin, Z., Zhou, J.T. and Szolovits, P., 2020. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. arXiv preprint. Available at: https://arxiv.org/abs/1907.11932

Li, L., Ma, R., Guo, Q., Xue, X. and Qiu, X., 2020. BERT-ATTACK: Adversarial attack against BERT using BERT. arXiv preprint. Available at: https://arxiv.org/abs/2004.09984

Pruthi, D., Dhingra, B. and Lipton, Z.C., 2019. Combating adversarial misspellings with robust word recognition. arXiv preprint. Available at: https://arxiv.org/abs/1905.11268

Ren, S., Deng, Y., He, K. and Che, W., 2019. Generating natural language adversarial examples through probability weighted word saliency. In: A. Korhonen, D. Traum and L. Màrquez, eds. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 1085-1097. Available at: https://aclanthology.org/P19-1103

About Mindgard

Mindgard is a cybersecurity company specializing in security for AI.
Founded in 2022 at world-renowned Lancaster University and is now based in London, Mindgard empowers enterprise security teams to deploy AI and GenAI securely. Mindgard’s core product – born from ten years of rigorous R&D in AI security – offers an automated platform for continuous security testing and red teaming of AI.

Next Steps

Thank you for taking the time to explore our latest blog article.

Trial Our Free Platform: Experience how our Automated Red Teaming platform swiftly identifies and remediates AI security vulnerabilities. Try our Trial today!
Follow Mindgard: Stay updated by following us on LinkedIn and X, or join our AI Security community on Discord.
Get in Touch: Have questions or want to explore collaboration opportunities? Reach out to us, and let's secure your AI together.
Please, feel free to request a demo to learn about the full benefits of Mindgard Enterprise

The Double-Edged Sword of GenAI: How Attackers and Defenders Are Battling for Cybersecurity

GenAI is transforming cybersecurity, enabling both attackers and defenders to scale their capabilities. While threat actors use AI for sophisticated phishing and automation, enterprises must adopt AI-driven security measures to stay ahead in the evolving cyber arms race.

5 Biggest AI Agent Security Challenges (and How To Protect Against Them)

AI agents introduce unique security threats—like memory poisoning, prompt injection, and privilege misuse—that require targeted defenses such as RBAC, input validation, rate limiting, and continuous monitoring to ensure safe deployment and operation.

How To Measure the Effectiveness of a Red Teaming Assessment

Measuring the effectiveness of a red teaming assessment is crucial for ensuring continuous improvement in cybersecurity defenses.