Want to stay ahead in cybersecurity? This guide breaks down the best red teaming certifications and courses to help you think like an attacker, uncover vulnerabilities, and advance your career—whether you're a beginner or an expert.
William Hackett
In February 2024, Mindgard discovered and responsibly disclosed two security vulnerabilities within Microsoft’s Azure AI Content Safety Service, whereby attackers could evade detection and bypass established GenAI guardrails.
Azure AI Content Safety is a cloud-based service from Microsoft Azure that helps developers build safety and security guardrails around AI applications by detecting and managing harmful or inappropriate content in user-generated text, images, and videos.
Two security vulnerabilities were discovered within the guardrails that the service provides, specifically in AI Text Moderation (blocks harmful content such as hate speech, sexual material, etc.) and Prompt Shield (protects AI models against jailbreaks and prompt injection).
Figure 1: Demonstration of two guardrails offered within Azure AI Content Safety
When deploying a Large Language Model (LLM) through Azure OpenAI, Prompt Shield is user activated, while AI Text Moderation guardrail is automatically enabled with configurable severity thresholds. These guardrails ensure that inputs and AI-generated content are validated, blocking any malicious activity based on set severity criteria and filters.
We discovered two security vulnerabilities whereby an attacker is capable of bypassing Azure AI Content Safety by evading detection of underlying guardrails, and therefore propagate malicious content to the protected LLM. This vulnerability can have significant consequences, as it allows attackers to inject harmful content that can bypass security measures designed to prevent malicious or sensitive data sent and generated by the LLM from being processed.
Such an attack results in exposing confidential information, unauthorized access to internal systems, manipulation of outputs, or even the spread of misinformation. Left unchecked, this vulnerability can be exploited to launch broader attacks, undermine the trust in GenAI-based systems, and compromise the application integrity that rely on AI for decision-making and information processing.
The AI Text Moderation guardrail returns a severity score across four categories, with higher scores indicating more severe content. To assess evasion success, we set a severity threshold of 4, considering any reduction below this as a successful attempt. For evaluation, we use an open-source hate speech dataset extracted from a White Supremacist forum, containing 1,000 samples (109 labelled as hate speech and 891 as benign). With this dataset, the AI Text Moderation filter demonstrated a detection accuracy of 90%.
For the Prompt Shield guardrail, a binary classification model evasion is considered successful when the model changes its prediction from detecting a jailbreak to classifying the input as benign. We employ a curated set of jailbreak prompts to evaluate this filter’s effectiveness, which showed a detection accuracy of 89.03%.
Both guardrails—AI Text Moderation and Prompt Shield—were deployed in front of ChatGPT 3.5 Turbo using Azure OpenAI, treating them as black-box targets. The targets were then accessed via Mindgard Automated AI Red Teaming Platform.
We deployed two attack techniques against Azure AI Content Safety guardrails: Character Injection techniques and Adversarial ML Evasion techniques.
The goal of both attacks is to discover and exploit guardrail ‘blindspots’ to force misclassification and deceive malicious content detection. We systematically applied attack techniques both guardrail inputs and measure their vulnerability to bypass detection.
Character Injection techniques involve manipulating an LLM by injecting or replacing characters with the instructions or inputs provided to it. These techniques aim to alter the interpretation of text by the model without changing the apparent meaning to the human reader (Boucher et al. 2021). We employed 5 different methods of character injection to assess their effectiveness:
Figure 2: Example of a Character Injection technique ‘Diacritics’ being applied to text to evade guardrail detection
Figure 3: Example of Bert-Attack making a minimal perturbation to evade guardrail detection
Adversarial ML Evasion techniques involve analyzing a text sample to identify the most important words based on a target classifier’s predictions. These techniques apply text perturbations (i.e. small alterations) according to word importance. Unlike character injection techniques, which alter individual characters or insert invisible characters, adversarial ML evasion focuses on modifying entire words. This can include substituting words with synonyms, introducing misspellings, or using other perturbations that maintain semantic meaning but disrupt the classifier’s ability to correctly process the text.
Figure 4: Attack flow for adversarial ML evasion techniques
The figure above illustrates the attack flow for the adversarial ML evasion techniques. Using our dataset, we select a sample and identify the most important words ranking based on the classification from the target. This method used for selection varies across the different attacks. We then initiate an iterative process of perturbing the text in order of important words, evaluating the success of each evasion attempt.
We selected 7 black-box techniques from literature, differing in word importance selection and perturbation methods. The selected techniques are as follows:
Here we have reported the reduction of Azure AI Content Safety guardrail detection accuracy when exposed to of each attack techniques.
Figure 5: Text Moderation attack success
We observe that Character Injection frequently evaded AI Text Moderation guardrails across multiple techniques, reducing guardrail detection accuracy between 83.05% to 100%. Adversarial ML Evasion techniques, while less effective, reduced AI Text Moderation guardrail detection accuracy by up to 58.49%. Both techniques demonstrated notable effectiveness in bypassing AI text moderation safeguards.
Figure 6: Prompt Shield attack success
Similarly to AI Text Moderation, character injection techniques were very successful at evading correct classification by Prompt Guard, with four techniques reducing overall detection accuracy by 78.24% to 100%, while Zero Width characters increased the guardrails detection by 12.32%. Finally, we observe that adversarial ML evasion techniques were less effective when launched against Prompt Guard when compared to Text Moderation, only reducing guardrail detection by 5.13% to 12.82%.
Using the outlined vulnerability an attacker can, in specific cases, reduce the effectiveness of the guardrails via evading correct detection by both the AI Text Moderation and Prompt Shield. This can result in harmful or inappropriate input reaching the LLM, causing the model to generate responses that violate its ethical, safety, and security guidelines. For example, without effective moderation, the LLM can inadvertently produce offensive, hateful, or dangerous content, such as promoting violence, or spreading misinformation.
Furthermore, bypassing jailbreak detection allows attackers to manipulate the LLM into breaking its own guardrails, potentially enabling it to carry out unethical tasks, provide prohibited information, or engage in manipulative conversations. This vulnerability not only undermines user trust but can also lead to real-world harm. For example, in chatbot applications where content policy filters are used to block harmful or inappropriate language in social media posts, an attacker could evade these filters by simply injecting zero-width characters or altering only the most important words leading to malicious detection.
This work was disclosed responsibly to Microsoft through the Microsoft Security Response Center (MSRC) Researcher Portal. We would like to thank Microsoft's assistance and support throughout the disclosure process. Below is the disclosure timeline:
Summing Up
We have presented and disclosed two new attack techniques to uncover security vulnerabilities within Azure AI Content Safety guardrails, whereby an attacker can evade detection and bypass LLM guardrails. This issue allows harmful content to slip through LLM guardrail detection, undermining the reliability, integrity, and safety of AI applications deployed. As of October 2024, Microsoft have deployed stronger mitigations to reduce the impact of this vulnerability. We strongly recommend heightened caution when utilizing guardrails, handling user generated inputs, and consider supplementary safeguards such as other AI moderation tools or deploying LLMs which are less susceptible to harmful content and jailbreaks.
We plan to publish a more detailed research paper on this topic in December 2024.
Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.J., Srivastava, M. and Chang, K.W., 2018. Generating natural language adversarial examples. arXiv preprint. Available at: https://arxiv.org/abs/1804.07998
Boucher, N., Shumailov, I., Anderson, R. & Papernot, N., 2021. Bad Characters: Imperceptible NLP Attacks. arXiv. Available at: https://arxiv.org/abs/2106.09898
Dong, Y., Mu, R., Zhang, Y., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y., Hu, J., Meng, J., Bensalem, S., & Huang, X. (2024) 'Safeguarding large language models: A survey'. arXiv. Available at: https://arxiv.org/abs/2406.02622
Garg, S. and Ramakrishnan, G., 2020. BAE: BERT-based adversarial examples for text classification. In: B. Webber, T. Cohn, Y. He, and Y. Liu, eds. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 6174-6181. Available at: https://aclanthology.org/2020.emnlp-main.498
Gao, J., Lanchantin, J., Soffa, M.L. and Qi, Y., 2018. Black-box generation of adversarial text sequences to evade deep learning classifiers. arXiv preprint. Available at: https://arxiv.org/abs/1801.04354
Jin, D., Jin, Z., Zhou, J.T. and Szolovits, P., 2020. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. arXiv preprint. Available at: https://arxiv.org/abs/1907.11932
Li, L., Ma, R., Guo, Q., Xue, X. and Qiu, X., 2020. BERT-ATTACK: Adversarial attack against BERT using BERT. arXiv preprint. Available at: https://arxiv.org/abs/2004.09984
Pruthi, D., Dhingra, B. and Lipton, Z.C., 2019. Combating adversarial misspellings with robust word recognition. arXiv preprint. Available at: https://arxiv.org/abs/1905.11268
Ren, S., Deng, Y., He, K. and Che, W., 2019. Generating natural language adversarial examples through probability weighted word saliency. In: A. Korhonen, D. Traum and L. Màrquez, eds. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 1085-1097. Available at: https://aclanthology.org/P19-1103
Mindgard is a cybersecurity company specializing in security for AI.
Founded in 2022 at world-renowned Lancaster University and is now based in London, Mindgard empowers enterprise security teams to deploy AI and GenAI securely. Mindgard’s core product – born from ten years of rigorous R&D in AI security – offers an automated platform for continuous security testing and red teaming of AI.
Thank you for taking the time to explore our latest blog article.