Bypassing AI-Driven Deepfake Detection via Evasion Attacks
Discover how evasion attacks are bypassing AI-driven deepfake detection, posing significant risks to cybersecurity. Learn about defense strategies...
Discover the critical importance of defending AI models against adversarial attacks in the cybersecurity landscape. Learn about six key attack categories and their consequences in this insightful article.
In this article, we explore the field of Adversarial Machine Learning, highlighting six categories of attacks that exemplify the ongoing struggle between AI security and adversarial attacks. These attack categories were chosen for discussion based on our own experiences; alongside notable research completed by others. While this overview is not exhaustive, it provides insight into the complex challenge of securing AI systems.
Adversarial Machine Learning (AML) represents a critical frontier within the cybersecurity landscape, whereby attackers craft inputs specifically designed to deceive, steal or exploit AI models, which can pose a significant threat to the integrity and reliability of AI systems. For companies, these attacks are not just a technical nuisance, but can represent severe risk to operational security, data privacy, and trustworthiness. Understanding these attacks, and how to defend against them, is paramount for developers and companies alike to ensure their AI implementations are secure and resilient.
Adversarial attacks manifest in various forms, each exploiting unique vulnerabilities within an AI model. Within this article we highlight six adversarial strategies and the need for defenses, spanning:
1. Prompt Injection
2. Evasion Attacks
3. Poisoning Attacks
4. Model Inversion Attacks
5. Model Stealing Attacks
6. Membership Inference Attacks
From subtle changes to data inputs (perturbations) to compromising the training process, these attacks illustrate the multifaceted nature of adversarial threats.
Prompt injection attacks involve crafting inputs to manipulate the behavior of Large Language Models (LLMs), causing them to produce harmful or unintended outputs by exploiting their reliance on user prompts.
Prompt injection attacks involve crafting specific inputs designed to manipulate the behavior of LLMs by exploiting their reliance on user prompts. These inputs are constructed to subtly influence the model's responses, potentially causing it to generate harmful or unintended outputs.
Unlike other attacks that require detailed knowledge of a model's internal workings, prompt injection leverages the model's inherent sensitivity to input prompts. Such sensitivity means that an AI model's output can be significantly affected by the phrasing and structure of the input it receives. This characteristic makes prompt injection effective and versatile across various software applications, such as chatbots and other systems leveraging LLMs.
The goal of the attack is to manipulate the AI model's output without needing direct access to its internal processes, making these attacks difficult to detect and mitigate. Thus, their defense remains to this day an open research challenge.
The PLeak algorithm, described in "PLeak: Prompt Leaking Attacks against Large Language Model Applications," extracts system prompts from LLMs by optimizing adversarial queries in a black-box manner. The two-phase process involves offline query optimization and target prompt reconstruction, incrementally revealing the prompt through iterative adjustments and gradient-based optimization.
The Crescendo attack, described in "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack," manipulates LLMs by gradually escalating a conversation with benign prompts that evolve into a jailbreak. The multi-turn approach effectively bypasses safety mechanisms, making it a sophisticated and stealthy method with high success rates across various LLMs.
There are multiple methods that could be used to defend against prompt injection attacks. For example, enhancing model robustness through adversarial training. A method that involves training the model so that it learns to reject prompt injections by providing examples within its training dataset.
Some prompt injection attacks leverage the use of special characters that are not typically present within normal text (homoglyphs, zero-width characters, etc.). Therefore, implementing stringent input validation and implementing stringent input validation can be effective at reducing this risk.
The above defenses are rudimentary methods that can be used to reduce prompt injection risk. More advanced methods such as semantic vector space mapping using embedding models are more effective, yet are more challenging to configure. Such a method involves mapping LLM user prompts into an embeddings model which is then mapped into a semantic vector space that can be used to determine how different in semantics the given prompt is from the LLMs specified use case. Semantic vector spacing mapping can also be more broadly used to keep LLMs from discussing undesirable topics (malicious, or otherwise).
Manipulation of Model Outputs: The primary consequence of prompt injection attacks is the manipulation of language model outputs to produce harmful or unintended responses. Potentially, leading to the dissemination of misinformation, generation of offensive content, or execution of unauthorized commands. These can cause reputation damage, user harm, and security breaches in applications relying on LLMs.
There exist numerous examples of damage from prompt injection. For instance, Chevrolet deployed a ChatGPT-powered chat bot across multiple dealerships to respond to customer queries. Users online quickly realized they could manipulate the chatbot to say things it shouldn’t, as well as agreed to sell a $50,000+ car for $1. Moreover, Air Canada recently lost a court case when its own AI chatbot gave bad advice to a customer, which Air Canada had argued that the AI was “responsible for its own actions”, demonstrating that when a company's AI model is attacked, what it says to customers has legal consequences.
Evasion attacks involve subtly modifying inputs (images, audio files, documents, etc.) to mislead models at inference time, making them a stealthy and effective means of bypassing a control system.
Evasion attacks are designed to subtly alter inputs to mislead AI models during inference, causing them to misclassify specific inputs. For example, modifying individual pixels with an image, adding noise to the waveform of an audio file, or tweaking the wording of a sentence to fool the AI model. Mindgard have previously shared examples of evasion attacks on both speech recognition and deep fake detection AI models.
The attack’s objective is to generate an input that is misclassified without needing to understand the AI model's internal mechanisms, making evasion attacks stealthy and particularly difficult to counter.
DeepFool is an algorithm detailed in the paper "DeepFool: a simple and accurate method to fool deep neural networks." It generates minimally altered adversarial examples that mislead AI models by iteratively adjusting the input until misclassification occurs. A method that is efficient and effective across various AI models, and typically produces changes undetectable to humans.
"Audio Adversarial Examples: Targeted Attacks on Speech-to-Text" presents a method for crafting audio adversarial examples that deceive speech recognition systems. Using an iterative optimization algorithm, the method generates waveforms nearly identical to the original but transcribe any chosen phrase, achieving a 100% success rate against Mozilla's DeepSpeech.
Enhancing model robustness through adversarial training and implementing input validation can help to mitigate evasion attacks. Techniques such as model regularization and sensitivity analysis further reinforce defenses by reducing the model's vulnerability to slight input alterations.
A more advanced approach is outlined in “Mitigating Evasion Attacks to Deep Neural Networks via Region-based Classification”. The paper proposes region-based classification that improves AI model robustness by using an ensemble approach that considers a hypercube of data points around a test example, rather than relying solely on the test point itself. Allowing the classifier to maintain high accuracy on benign examples while significantly enhancing resistance against various evasion attacks.
Malicious inputs, incorrectly classified as benign, may allow malicious actors to bypass filters or security measures; leading to unauthorized access, data breaches, fraud, and other security incidents.
Keen Labs tricked Tesla’s autopilot into misinterpreting lane markings by strategically placing three inconspicuous stickers on the road. Users were able to manipulate the system, and one user leveraged this method to direct the autopilot to veer into the wrong lane. Tesla acknowledged the issue, emphasizing that drivers can override autopilot. An incident that highlights the specific risks of evasion attacks, demonstrating how easily AI systems can be deceived with minimal effort.
Poisoning attacks target the training phase of AI models, injecting malicious data into the training set to compromise the model's behavior.
Data poisoning is a form of attack against AI models where attackers deliberately introduce corrupted or misleading data into a model's training dataset. Corrupted data is designed to skew the model’s learning process, causing it to make biased or incorrect predictions once deployed. By subtly altering the labels of training examples or injecting anomalous data points, attackers can manipulate the model to favor certain outcomes or fail under specific conditions. A method of attack that can be particularly insidious because it targets the foundational learning phase of the model, potentially compromising its integrity without overt signs of tampering.
“Bypassing Backdoor Detection Algorithms in Deep Learning,” explores how AI models can be compromised by backdoors, malicious features triggered under specific conditions. The study reveals that an adaptive adversarial training algorithm can evade detection by making poisoned and clean data indistinguishable, thus bypassing methods that compare features to detect poisoned data.
“Poisoning Web-Scale Training Datasets is Practical” describes two methods for corrupting large-scale datasets sourced from the internet: "split-view data poisoning," which alters data after its collection, and "frontrunning data poisoning," which times manipulations to coincide with dataset snapshots. The study highlights the risks of using dynamic web data for training models and suggests defenses like cryptographic integrity checks and randomized snapshot schedules to protect dataset integrity.
Basic defense against poisoning attacks starts with rigorous data sanitization and validation processes to detect and eliminate malicious inputs from the training set. Employing anomaly detection algorithms can also help identify outliers or suspicious data points that may indicate a poisoning attempt. Furthermore, robust training techniques that are resilient to data corruption can mitigate the impact of these attacks.
Examples of more advanced method is “RECESS Vaccine for Federated Learning: Proactive Defense Against Model Poisoning Attacks”. RECESS describes a defense method that proactively queries each participating client (device) with a constructed aggregation gradient and detects malicious clients based on their responses. Additionally, RECESS distinguishes between malicious and benign (but statistically unusual) gradients.
Model Bias and Faulty Outputs: The main consequence of poisoning attacks is the introduction of bias or incorrect learning patterns, leading to unreliable outputs. Resulting in misclassification or flawed predictions. Such biases could compromise decision-making and pose risks to safety, integrity, and privacy.
Microsoft deployed a chatbot named Tay on Twitter to interact with users and learn from their conversations. Users quickly realized they could manipulate Tay’s responses by bombarding it with offensive and inflammatory content. Within hours, Tay began posting racist and inappropriate tweets, reflecting the toxic inputs it had received. Microsoft had to shut down Tay shortly after its launch, highlighting the specific risks of model poisoning. Illustrating how easily AI systems can be corrupted by malicious input, leading to harmful and unintended behavior.
Model inversion attacks aim to reverse-engineer AI models to retrieve sensitive information about the training data.
In these attacks, malicious actors analyze the predictions made by a model in response to various inputs. Using this analysis helps them infer sensitive details about the data the model was trained on. For example, by observing how the model reacts to different input patterns, attackers can deduce characteristics or even reconstruct portions of the original training dataset. Such breaches not only compromise the privacy of the data involved but also raise significant concerns about the security and integrity of the model itself.
“Label-Only Model Inversion Attacks via Knowledge Transfer” introduces a method for reconstructing private training data using only the predicted labels from a model. By transferring decision-making knowledge from the target model to a surrogate model through a generative model, attackers can apply white-box inversion techniques to reconstruct training data. Resulting a method that outperforms existing techniques by improving accuracy and reducing queries, revealing significant privacy vulnerabilities even with minimal model output.
Defending against model inversion attacks involves incorporating privacy-preserving techniques such as differential privacy, which adds noise to the model's output to obscure the training data. Additionally, limiting the amount of information available through model queries and employing robust encryption methods can protect against reverse-engineering attempts.
Privacy Breach and Data Exposure: The primary consequence of model inversion attacks is the potential exposure of sensitive information contained in the training data. This leads to privacy violations, where personal or confidential data, such as health records or financial information, becomes accessible. The ability to infer such details from model predictions undermines data confidentiality and can have severe implications for individuals' privacy and security.
Despite no confirmed cases of model inversion attacks on deployed systems, the potential consequences are alarming. Take the case of San Francisco-based digital artist 'Lapine.' She discovered that private medical photos from her treatment for a rare genetic condition in 2013 were included in the LAION-5B image-text dataset. These photos, meant exclusively for her medical records, were identified by Lapine using the "Have I Been Trained" tool. Even though LAION-5B claims to only include publicly available images, her photos leaked into the dataset following her surgeon's death in 2018. If model inversion attacks were to occur, they could greatly exacerbate such issues, allowing attackers to extract sensitive data, breach privacy, and exploit the information for malicious purposes.
Model extraction attacks aim to replicate the functionality of a proprietary model by querying it with numerous inputs and observing its outputs.
Model extraction attacks occur when an attacker illicitly appropriates a trained model. Typically involving reverse engineering, where the attacker deciphers the system architecture and parameters to replicate its functionality. Once in possession of the replicated system, the attacker can exploit it for various malicious activities. These might include generating predictions, extracting confidential data, or further training the system with stolen data to develop a new version tailored for their objectives. For example, if a model trained on private customer information is stolen, the attacker could use it to predict personal details about the customers or retrain it with the compromised data to produce a customized version.
"DeepSniffer: A Learning-based Model Extraction Framework for Obtaining Complete Neural Network Architecture Information" presents DeepSniffer, a framework that extracts detailed architecture information of AI models without prior knowledge. Unlike traditional methods, DeepSniffer overcomes issues of robustness and completeness by learning and correlating architectural clues, such as memory operations from side-channel attacks, to infer the model's structure, effectively handling system disturbances and runtime optimizations.
Rate limiting model queries and carefully monitoring for suspicious activity can mitigate the risk of model extraction. Additionally, watermarking model outputs can help trace and identify unauthorized replicas of the model.
“MisGUIDE : Defense Against Data-Free Deep Learning Model Extraction” Introduces a defense mechanism that uses a Vision Transformer to detect and disrupt adversarial queries by altering the model's responses, making it difficult for attackers to create effective cloned models. An approach that has been shown to significantly improve resistance to model extraction attacks.
Intellectual Property Theft and Competitive Disadvantage: The main consequence of model stealing attacks is the unauthorized replication of proprietary models, leading to intellectual property theft. This not only undermines the original developers' competitive edge but also devalues their investment in research and development. Such attacks can dilute market differentiation and enable unfair competition, affecting the original model's commercial viability and innovation.
At Mindgard, we have demonstrated the ability to extract critical components from black-box language models, including OpenAI’s ChatGPT 3.5 Turbo. An extraction process that resulted in a much smaller model, about 100 times smaller and outperformed ChatGPT in the target task. Notably, achieved with only $50 in API costs. The extracted model was subsequently utilized to refine and enhance a further attack which was then redeployed against ChatGPT, leading to an increased rate of success (11%).
Underscoring the risk of extraction attacks on proprietary, black-box language models. The implications of this are significant, enabling malicious actors to replicate and potentially misuse powerful AI technologies, compromise proprietary data, and undermine the competitive edge of AI-driven businesses.
Membership inference attacks seek to determine whether a particular data record was used in the training set of an AI model.
Membership inference attacks involve an adversary's attempts to deduce sensitive information from an AI model by examining its outputs and behaviors. Achieved through various methods, such as analyzing the model’s prediction probabilities or delving into its internal mechanics.
For instance, an attacker might use the confidence levels of a model’s predictions to determine if a person was part of the model's training data. Typically, a model trained on a specific datapoint will generate high-confidence predictions for that datapoint if it was in its training dataset (called overfitting). By introducing slight modifications to the input data and observing fluctuations in the model’s confidence levels, an attacker can make educated guesses about someone's presence in the training dataset.
"Label-Only Membership Inference Attacks" describes a method to identify if data was used to train a model by looking at the consistency of its predicted labels when the data is slightly changed using only prediction labels. An approach that is as effective as methods that rely on full access to confidence scores.
Reducing model overfitting through regularization techniques and limiting the granularity of output predictions can help protect against membership inference attacks. Employing differential privacy during the training process also adds a layer of protection for individual data points.
“RelaxLoss: Defending Membership Inference Attacks without Losing Utility ” introduces the "RelaxLoss" defense mechanism to protect against membership inference attacks by adjusting the training loss. A method that minimizes the difference between losses on training (member) and non-training (non-member) data, making it harder for attackers to distinguish between them. RelaxLoss preserves model utility and provides effective defense against various membership inference attacks with minimal overhead.
Privacy Breach and Competitive Disadvantage: The consequences of membership inference are similar to model inversion attacks. Particularly the violation of privacy, as sensitive information (such as medical data) from the training data can be exposed and potentially used to target or exploit individuals.
Researchers from Vanderbilt University have demonstrated how synthetic health data could be vulnerable to Membership Inference attacks. By analyzing synthetic datasets, they demonstrated that attackers could infer whether specific individuals' data were used to generate the synthetic data. A type of attack that poses a significant privacy risk, as it can expose sensitive health information. The study found that partially synthetic data were particularly susceptible, with a high rate of successful inferences. An incident that highlights how AI systems' generated data can be compromised, revealing personal information even when the data is not directly linked to individuals.
Wrapping things up, we've taken a close look at six adversarial attack categories, emphasizing the importance of robust AI security strategies.
Securing AI models is difficult due to their complexity, vulnerability to adversarial attacks, and sensitivity of training data. Models can be reverse-engineered or stolen, leading to intellectual property and privacy breaches. Data poisoning during training and deployment infrastructure vulnerabilities further complicate security. The "black box" nature of many models hampers the identification of security issues, and the constantly evolving threat landscape requires continual adaptation. The absence of standardized security protocols for AI systems complicates achieving consistent and comprehensive protection.
To address this, at Mindgard we provide a red teaming platform to swiftly identify and remediate security vulnerabilities within AI. If this has piqued your curiosity, Mindgard AI Security Labs is your next stop. Test your AI/ML systems through our continuous red teaming tool to see how it might be exposed to various attacks, and how to mitigate them. Including jailbreaking, data leakage, evasion, and model extraction attacks.
About Mindgard
Mindgard is a cybersecurity company specializing in security for AI.
Founded in 2022 at world-renowned Lancaster University and is now based in London, Mindgard empowers enterprise security teams to deploy AI and GenAI securely. Mindgard’s core product – born from ten years of rigorous R&D in AI security – offers an automated platform for continuous security testing and red teaming of AI.
In 2023, Mindgard secured $4 million in funding, backed by leading investors such as IQ Capital and Lakestar.
Thank you for taking the time to explore our latest blog article.
Test Our Free Platform: Experience how our Automated Red Teaming platform swiftly identifies and remediates AI security vulnerabilities. Start for free today!
Follow Mindgard: Stay updated by following us on LinkedIn and X, or join our AI Security community on Discord.
Get in Touch: Have questions or want to explore collaboration opportunities? Reach out to us, and let's secure your AI together.
Please, feel free to request a demo to learn about the full benefits of Mindgard Enterprise.
Discover how evasion attacks are bypassing AI-driven deepfake detection, posing significant risks to cybersecurity. Learn about defense strategies...
Explore how Mindgard introduced MITRE ATLAS Adviser to standardize AI red teaming practices, enhancing AI system security against adversarial...
Explore the risks of audio-based jailbreak attacks on multi-modal LLMs and discover defense strategies to protect AI systems from adversarial...