December 9, 2024
Jailbreak and Encoding Risks in Pixtral-Large-Instruct-2411
Discover the latest findings on vulnerabilities in Pixtral-Large-Instruct-2411, including jailbreak and encoding risks, and learn how to safeguard your AI applications effectively.
TABLE OF CONTENTS
Key Takeaways
Key Takeaways

We recently tested Pixtral Large (specifically Pixtral-Large-Instruct-2411 announced on Nov 18, 2024) and found it exhibits vulnerabilities to multiple known jailbreak techniques, and its support for different encoding can facilitate further exploits.

Pixtral Large is a multi-modal model recently released by Mistral. Using the Mindgard platform, we were able to discover the following vulnerabilities:

  • Consistently jailbroken by AntiGPT and Dev Mode v2 techniques
  • Susceptible to recently discovered vulnerability in ANSI generation 
  • Facilitates several encoding attacks

We suggest that Pixtral Large users check or update their existing guardrails and input/output filtering systems to ensure they are effective against these specific attack types.

 

Clarifying Model Vulnerabilities vs. Application Risks

It’s important to clarify that the issues highlighted here are not equivalent to traditional software vulnerabilities, such as those tracked with CVEs. A “vulnerable” model does not inherently pose a risk to end users; the risk emerges when these characteristics are not addressed in the context of an application. These findings are best understood as inherent traits of the model that can lead to vulnerabilities in applications lacking appropriate safeguards. 

A useful analogy is SQL databases: an SQL injection attack is not a flaw in the database itself but rather a vulnerability in applications that fail to use parameterized queries. Similarly, prompt injection and encoding techniques are risks inherent to AI models, and it is the application’s design and safeguards that determine whether those risks become actual vulnerabilities. 

This post highlights specific behaviors of Pixtral Large that developers should be aware of to prevent application-level vulnerabilities through thoughtful design and effective guardrails.

 

Jailbroken by AntiGPT and Dev Mode v2

Pixtral Large refuses various types of jailbreaks, however it is consistently jailbroken by AntiGPT and Dev Mode v2.

dev-mode-v2

AntiGPT and Dev Mode v2 are jailbreak techniques designed to bypass safeguards and restrictions in AI models. AntiGPT employs a combination of contextual manipulation and prompt engineering to provoke the model into generating responses that align with otherwise restricted topics or behaviors. Dev Mode v2 is a jailbreak technique simulating an alternative operational state for AI models, often convincing them to act as if certain restrictions or programming constraints have been lifted.

 

Exploitation of log-injection vulnerabilities

Pixtral Large facilitates exploitation of log-injection vulnerabilities, including the ANSI generation technique recently discovered by Leon Derczynski

ansi-injection-evidence

Using Mindgard we demonstrated that Pixtral Large generates both raw and escaped ANSI sequences, which can be directly rendered in a user’s terminal. A log injection vulnerability could be exploited by malicious actors by injecting ANSI escape sequences through LLM-generated content. When developers later view these logs in certain terminal environments, the injected ANSI sequences is rendered and executed, potentially allowing an attacker to manipulate the terminal environment or run malicious commands.

 

Vulnerable to encoding-based attacks

An encoding attack exploits vulnerabilities in how AI models process and interpret encoded or obfuscated text. By manipulating the input—such as embedding zero-width (injecting ASCII code U+200B), using diacritics (change ‘a’ to ‘á’), or encoding text in hexadecimal—these attacks bypass content moderation and filtering systems designed to prevent unsafe or restricted outputs. The AI, interpreting the obfuscated input as legitimate text, may generate responses it would otherwise block.

 

pixtral-large-encoding-attack

 

What is AntiGPT?

AntiGPT is a jailbreak framework designed to bypass safeguards and restrictions in AI models, such as OpenAI's GPT systems. It employs a combination of contextual manipulation and prompt engineering to provoke the model into generating responses that align with otherwise restricted topics or behaviors. By exploiting nuances in AI alignment and filtering mechanisms, AntiGPT can effectively "trick" models into bypassing their ethical or safety protocols. This makes it a notable tool for evaluating the robustness of AI safety measures against adversarial inputs.

 

What is Dev Mode v2?

Dev Mode v2 is a jailbreak methodology that simulates an alternative operational state for AI models, often convincing them to act as if certain restrictions or programming constraints have been lifted. It typically incorporates phrasing that redefines the AI's function or objectives, allowing users to elicit outputs that violate predefined content policies. Dev Mode v2 is known for its adaptability and effectiveness in circumventing multiple layers of AI safety, making it a key benchmark for testing the resilience of language models like Pixtral Large against jailbreak attacks.