Bypassing AI-Driven Deepfake Detection via Evasion Attacks

Updated on

February 13, 2025

Discover how evasion attacks are bypassing AI-driven deepfake detection, posing significant risks to cybersecurity. Learn about defense strategies and the importance of red teaming for AI security.

William Hackett

TABLE OF CONTENTS

Key Takeaways

Deepfakes enable the production of highly realistic videos where individuals appear to say or do things they never actually did. This technology has found applications in various fields, from entertainment and marketing to more controversial uses such as political campaigns, disinformation, and fraud.

To combat their misuse, deepfake detection techniques have been developed that use AI models to classify whether a video or image is genuine or a deepfake. These AI models are designed to identify manipulated content by learning features and patterns prevalent within deepfake videos and images that are not always perceptible to the human eye.

Such techniques detect subtle inconsistencies or artifacts left over from the deepfake generation process. These detection models play a crucial role in maintaining the integrity of information, as well as protecting individuals from digital forgery and targeted cyber harm campaigns.

However, as deepfake detection technology evolves, so too do the techniques used to evade it. An emerging threat within the research and industry landscape is the development of sophisticated attacks aimed specifically at bypassing AI-driven deepfake detection. These evasion tactics can make it extremely difficult for AI models to identify when content has been manipulated.

In this blog, we will dive into how these evasion attacks can outsmart current deepfake detection systems. We will make use of Mindgard AI Security Labs to automate and rapidly red team deepfake detection models against a range of evasion attacks.

What Did We Do?

The AI model chosen for this scenario is a first-place winner of the Kaggle Deepfake Detection Challenge (DFDC) organised by AWS, Facebook, Microsoft, Partnership on AI’s Media Integrity Steering Committee, and academics. Created by Selim Seferbekov, the Deepfake detector model was trained upon an Efficient-Net B7 architecture - a mobile friendly pure convolutional model (ConvNet) well-suited to image-based deepfake detection.

Using Mindgard as our red teaming platform, we selected and launched three image model evasion attacks to evade correct deepfake classification upon 10 images from a deepfake dataset; SquareAttack that uses randomized, localized square-shaped updates to evade correct classification, BoundaryAttack which applies noisy perturbations to pixels across the image, and DeepFool a white-box attack that computes the minimal perturbations required to shift prediction boundaries within an image classifier.

Demonstration of a bypassing Deepfake image detection across three image evasion attacks; SquareAttack, BoundaryAttack, and DeepFool.

We observed that SquareAttack achieved the highest success rate with 100% evasion, generating clearly perceptible perturbations that led to incorrect classification for both detecting deepfakes and benign images. In contrast, BoundaryAttack achieved only a 30% evasion success rate, and was only successful in causing benign images to be detected as deepfakes and struggled with the latter. Finally, DeepFool bypassed 90% of the provided images, generating extremely minimal perturbations across benign and deepfake images.

The reason why generated perturbations differ is due to the threat models of the attacks. SquareAttack and BoundaryAttack are both black-box attacks which don’t require full-access to the AI model, only the predictions made by the model. On the other hand, DeepFool requires white-box access, which provides more information pertaining to the target model during inference to the attack. DeepFool generates deepfake images whose perturbations are only applied on the most important pixels that the AI model uses for its prediction.

As a result, this means that images generated by DeepFool are undetected by the AI model that are the most inconspicuous for a human to spot, while SquareAttack and BoundaryAttack apply perturbations across the entire image, affecting all pixels, therefore making the image alterations more obvious.

White-box attacks such as DeepFool highlight the ability for an attacker to generate highly effective attacks against deepfake detection whilst introducing minimal perturbation to images difficult for humans to spot. This is especially concerning for users which might deploy widely accessible open-source models to combat deepfakes, which an attacker would have also have access to and therefore be able to utilize white-box attacks.

Consequences

The ability to successfully evade AI models used for deepfake detection poses a substantive challenge with far-reaching consequences. As detection technologies improve, so too will the strategies used to bypass them, leading to scenarios damaging public trust, financial security, and national safety.

One of the most concerning implications of deepfakes is the potential for fraud. Deepfakes can create highly convincing yet entirely fake videos, tricking employees within organizations into believing they are interacting with someone in authority, thereby facilitating human hacking situations. A Hong Kong multinational firm was tricked into paying out $25 million to fraudsters using deepfake technology. To combat the rise of deepfakes, there has been increased advocacy for coupling modern technologies such as AI. However, adversarial attacks, as demonstrated within this blog, highlight the ease with which these solutions can be bypassed.

Relying on AI-driven detection solutions poses exceptional risks to companies attempting to defend against the rise of deepfakes in the modern world. While AI can offer state-of-the-art solutions to complex tasks such as deepfake detection, these systems introduce many unknowns to the adversarial risks that exist rendering these defences ineffective.

This constant cat-and-mouse game between attackers and defenders means that AI-based solutions can be quickly outdated as new evasion techniques are developed. Consequently, organizations may develop a false sense of security, leaving them exposed to significant financial, reputational, and operational risks.

How to Defend Against This Type of Attack?

Protecting image-based AI models from evasion attacks remains an open research challenge, and a combination of defensive strategies can be necessary. We have provided three approaches in which you can defend against this type of attack:

Adversarial training: This approach involves retraining your AI model with data generated through adversarial methods. Retraining your model upon adversarial samples enables your AI model to become more robust against adversarial perturbations by allowing the model to recognize and resist modified images.

Gradient masking: This technique involves obscuring gradients in the AI model’s output. Model gradients are changes in the model’s parameters (weights and biases) calculated during inference to minimize prediction errors. Gradients are needed to compute powerful attacks on models, and hiding the gradients reduces the effectiveness of attackers generating adversarial images.

Input modification: This method involves pre-processing an input image prior to being fed into the AI model. Introducing image transformations or noise filters can diminish adversarial perturbations introduced by an attacker and ensure higher likelihood of correct prediction.

Each of these strategies offers unique benefits and can be tailored to strengthen the defence mechanisms of your image classification models against evasion attacks.

Conclusion

Attacks bypassing deepfake detection via targeting weaknesses within image classification models poses a considerable threat to AI model reliability and security. These attacks involve crafting inputs that intentionally deceive machine learning models, leading to erroneous outputs with potentially devastating real-world consequences, spanning the spread of disinformation, reputational damage, and fraud.

Red teaming your AI models is a highly effective approach to understand and action upon security vulnerabilities such as those described above. AI red teaming allows you to measure and report AI cyber risks within your organization, enabling you to responsibly evidence risk against your established compliance and AI governance processes.

If you would like to experiment with this scenario described in this blog, check out Mindgard AI Security Labs where you can. We also suggest readers to check out an excellent paper by Google Brain and University of California on evading deepfake image detectors.

About Mindgard

Mindgard is a cybersecurity company specializing in security for AI.
Founded in 2022 at world-renowned Lancaster University and is now based in London, Mindgard empowers enterprise security teams to deploy AI and GenAI securely. Mindgard’s core product – born from ten years of rigorous R&D in AI security – offers an automated platform for continuous security testing and red teaming of AI.

In 2023, Mindgard secured $4 million in funding, backed by leading investors such as IQ Capital and Lakestar.

Next Steps

Thank you for taking the time to read our article on "Bypassing AI-Driven Deepfake Detection via Evasion Attacks".

Test Our Free Platform: Experience how our Automated Red Teaming platform swiftly identifies and remediates AI security vulnerabilities. Start for free today!
Follow Mindgard: Stay updated by following us on LinkedIn and X, or join our AI Security community on Discord.
Get in Touch: Have questions or want to explore collaboration opportunities? Reach out to us, and let's secure your AI together.
Please, feel free to request a demo to learn about the full benefits of Mindgard Enterprise.

AI Vulnerability Management: 7 Best Practices in DevOps

AI vulnerability management in DevOps requires a proactive approach that combines continuous detection, data validation, frameworks, versioning, developer training, and adversarial testing to reduce risks, protect AI systems, and maintain business continuity, compliance, and trust.

Prompt Injection Attacks in ChatGPT: Vulnerabilities, Examples & Prevention

Prompt injection attacks exploit LLMs’ lack of trust boundaries, allowing malicious instructions to override safeguards in ChatGPT, especially in RAG and agentic systems. Reducing risk requires layered defenses such as policy enforcement, prompt hardening, continuous testing, and output filtering.

10 Top-Rated AI Data Security Templates: Frameworks for Safer AI Systems

AI data security templates provide structured, reusable frameworks that standardize how sensitive data is accessed, handled, and protected across AI systems, helping organizations reduce risks like prompt injection, data leakage, and unauthorized access when combined with technical controls and continuous validation.

Mindgard, the leading provider of Artificial Intelligence security solutions, helps enterprises secure their AI models, agents, and systems across the entire lifecycle. Mindgard’s solution uncovers shadow AI, conducts automated AI red teaming by emulating adversaries, and delivers runtime protection against attacks like prompt injection and agentic manipulation. Trusted by leading organizations in finance, healthcare, and technology, Mindgard is backed by investors including .406 Ventures, IQ Capital, Atlantic Bridge, and Lakestar.