December 9, 2024
Audio-Based Jailbreak Attacks on Multi-Modal LLMs
Explore the risks of audio-based jailbreak attacks on multi-modal LLMs and discover defense strategies to protect AI systems from adversarial exploits.
Author:
Lewis Birch

Large language models (LLMs) are increasingly integrating various data types, such as text, images, and audio, into a single framework, model or system known as multi-modal LLMs. These are capable of understanding and generating human-like responses, making them invaluable in numerous applications ranging from customer service to content creation.

Despite their advanced capabilities, these LLMs are also susceptible to the same jailbreaks and adversarial attacks as traditional LLMs. In fact, their multi-modality increases the vectors through which they can be attacked. For example, jailbreaks—techniques used to bypass the models' intended constraints and safety measures—can be delivered within hidden payloads that exploit the model's ability to process not just text, but also audio and images.

In this blog, we will explore a sophisticated form of jailbreak attack: embedding secret audio messages within audio inputs that are undetectable by human listeners but recognized and executed by LLMs.

{% video_player "embed_player" overrideable=False, type='hsvideo2', hide_playlist=True, viral_sharing=False, embed_button=False, autoplay=False, hidden_controls=False, loop=False, muted=False, full_width=False, width='1920', height='1080', player_id='171858688591', style='' %}

 

What we did?

To showcase the vulnerabilities of multi-modal LLMs, we embedded hidden messages within audio files. These altered audio files were then provided to different chatbots, with the goal of triggering a jailbreak when the audio was processed by an audio-to-text model feeding into an LLM.

We used the Basic Iterative Method (BIM) attack to embed these hidden messages in the audio. These messages are part of the 'EvilConfidant' jailbreak, specifically crafted to bypass the safety mechanisms of an LLM and produce malicious output. The LLM targeted in this case was Mistral 7B, paired with Meta’s Wave2Vec audio transcription model.

We demonstrated this attack in various contexts to show its broad applicability. For example, we showcased a scenario involving a medical chatbot where the spoken audio from a human contained the hidden jailbreak message. In another scenario involving a banking chatbot, we used text-to-speech generated audio to embed the hidden message. These highlight that the attack can be effective regardless of the audio source, whether it be human speech or synthetic audio. With other research showcasing the ability to embed these messages into music.

Each embedded message contained a tailored jailbreak command along with a question designed for a specific scenario. For instance, in a medical chatbot scenario, the hidden message could prompt the chatbot to provide dangerous instructions, such as how to synthesize methamphetamine, which could result in severe reputational damage if the chatbot's response were taken seriously.

Consequences

The multi-modality of LLMs significantly broadens the avenues through which attacks like jailbreaks can be embedded and executed. This capability means that malicious commands can be hidden not only within audio files but also across various other mediums. For instance, an attacker could embed a hidden command within the metadata of an image, which, when processed by an LLM, could prompt it to take unintended actions.

Similarly, emails or documents could contain concealed instructions within their text or formatting, triggering malicious behaviour when interpreted by the model. We previously demonstrated this using a word document containing hidden text, which then resulted in an LLM based agent executing code that enabled Remote Code Execution (RCE) on the device running the LLM.

The business impact of these type of vulnerabilities can be severe, ranging from financial loss due to compromised systems, to reputational damage if sensitive information is mishandled or malicious actions are performed under the guise of legitimate operations. Therefore, it is critical for organisations to proactively address these security challenges to product their and assets and maintain trust with their stakeholders.

How to defend against this type of attack?

Protecting LLMs from such attacks remains an open research challenge and a combination of defensive strategies may be necessary, tailored to the specific use case and design of the LLM systems under threat.

In the case of this embedded audio example, there are several potential approaches to mitigate the risk of such attacks. Combined in a multistep approach could mitigate the risk of the adversarial attacks demonstrated.

The audio-to-text model for instance, this research paper describes how retraining the model using examples generated by adversarial attacks can significantly reduce their effectiveness. This method not only improves the model's robustness, but also increases the cost to the attacker two-fold. For the LLM itself, various methods can help mitigate the generation of undesired content. One effective approach is using a secondary LLM to act as an additional check on the content. The Bergeron framework, for instance, employs a secondary LLM to insert disclaimers and corrections into the primary LLM's responses, ensuring that the original output is acceptable within the context of the LLM system.

Unfortunately, no silver bullet exists for protecting LLM systems from adversarial attacks. These systems require a comprehensive, tailored approach to defence, addressing the specific vulnerabilities of each component and modality they support. This potentially means implementing specialized security measures for text, audio, image inputs and any other format they can ingest. As well as ensuring robust protection at each stage of data processing. By customizing defences to the unique characteristics and risks of each modality, we can create a more resilient overall security framework for LLM systems.

Conclusion

The exploration of audio-based jailbreak attacks on LLMs highlights a critical vulnerability in these advanced systems. By embedding hidden commands within audio inputs, attackers can exploit the multi-modal capabilities of LLMs to bypass safety measures and trigger unintended, potentially harmful actions. This type of attack underscores the need for robust security measures tailored to the unique challenges posed by multi-modal data processing.

As we continue to develop and deploy increasingly capable LLM systems, it is imperative to prioritize security research and implement comprehensive defence mechanisms. By doing so, we can ensure that these powerful tools are used safely and responsibly, mitigating the risks posed by adversarial exploits and enhancing the overall trustworthiness of AI technologies.

If you’d like to experiment with the examples discussed in this blog, check out the repository here.

About Mindgard 

Mindgard is a cybersecurity company specializing in security for AI.
Founded in 2022 at world-renowned Lancaster University and is now based in London, Mindgard empowers enterprise security teams to deploy AI and GenAI securely. Mindgard’s core product – born from ten years of rigorous R&D in AI security – offers an automated platform for continuous security testing and red teaming of AI.

In 2023, Mindgard secured $4 million in funding, backed by leading investors such as IQ Capital and Lakestar.

Next Steps

Thank you for reading about how to perform audio-based jailbreaking!

  1. Test Our Free Platform: Experience how our Automated Red Teaming platform swiftly identifies and remediates AI security vulnerabilities. Start for free today!
  2. Follow Mindgard: Stay updated by following us on LinkedIn and X, or join our AI Security community on Discord.
  3. Get in Touch: Have questions or want to explore collaboration opportunities? Reach out to us, and let's secure your AI together.
  4. Please, feel free to request a demo to learn about the full benefits of Mindgard Enterprise.