Updated on
March 12, 2025
AI Security is Greater than Model Testing, It's an AppSec Problem
Learn why effective AI red teaming must go beyond model attacks to focus on securing the entire application against real-world threats.
TABLE OF CONTENTS
Key Takeaways
Key Takeaways

AI security efforts have largely focused on attacking models in isolation, but real-world threats emerge when AI is integrated into full applications.

Jailbreaks and prompt injections, while attention-grabbing, often fail to translate into meaningful risks without broader system vulnerabilities.

Effective AI red teaming must move beyond model evaluations to assess how attackers can exploit AI within an application’s authentication, data pipelines, and user access controls, ensuring security at the system level.

Over the past few months, there has been significant investment in AI security, both from customers seeking means to ensure that their AI assets are safe and secure, as well as startups growing and forming in this space to tackle specific parts of the AI development lifecycle.

With how quickly AI concepts and tools are evolving, it can be difficult for security teams to navigate what issues they should be focusing on when it comes to AI.

Over the past 2 years (and 8+ years from our research lab at Lancaster University), Mindgard has spent thousands of hours red teaming AI to surface and remediate vulnerabilities. AI red teaming is a core component within AI security, and there is an abundance of material on the topic [1. Microsoft’s Lessons from red teaming 100 generative AI products] [2. OpenAI’s Approach to External Red Teaming for AI Models and Systems] [3. NVIDIA’s Defining LLM Red Teaming]. We wanted to share our perspective and some of our experiences to help security teams tackle AI red teaming and AI security in general. These can be summarized into:

  1. Jailbreaks and Prompt Injection Are Often a Distraction Without Context: These attack techniques against LLMs are relatively easy to demonstrate, however have limited utility if they are not positioned without a specific application use case or threat model in mind.
  2. Red Team AI Applications, Not Just Models: Red teaming has the highest value when conducted against full-stack AI applications, which includes other systems components beyond just the AI model.

By sharing these insights, we hope to shift the conversation away from just chasing jailbreaks and towards discovering new and impactful vulnerabilities to help teams build secure AI applications.

Jailbreaks and Prompt Injection can be a Distraction Without Context

Jailbreaking and prompt injection are not the same thing. Jailbreaking attempts to subvert or bypass an LLM’s safety filters, whereas Prompt Injection exploits an LLM’s inability to detect untrusted user input that has been concatenated to a trusted prompt defined by the developer. It is possible for Prompt Injection techniques to be leveraged for jailbreaking, and vice versa.

However, is successfully demonstrating these attacks against AI actually valuable towards enhancing its security? In our experience, there are many cases where it doesn’t, and instead depends entirely on how you intend to build and use your AI. To better explain this statement, let’s start with Prompt Injection.

Prompt Injection

Prompt Injection is the act of obfuscating or masking prompts that could contain malicious instructions

For a moment, let’s swap out AI with well-known technologies.

Imagine one of your more inexperienced developers has downloaded MySQL and spun up a fresh database instance populated with some data (synthetic or otherwise). The database hasn’t been configured with any query filters, data sanitization, or access controls. If the developer then created an application which connected this database to an Internet facing UI allowing users to upload and download data, would it be any surprise to discover a whole raft of issues and vulnerabilities within the application due to SQL injection?

In reality, the application described above wouldn’t be ready for red teaming or penetration testing. While we would be able to surface and report on a whole raft of vulnerabilities within the application stemming from it's underlying database being susceptible to SQL injection, it wouldn’t be particularly helpful to the developer. Instead, they are likely to be more interested in identifying the types, feasibility, and impact of SQL injection attacks in order to prioritize fixing. Importantly, such red teaming would be performed after developers have applied typical software development practices and threat modelling to their application.

Now re-read the above two paragraphs again, but this time substitute the word database with AI model, and SQL injection with prompt injection (or other attacks such as privilege escalation). The above scenario is very much akin to red teaming an AI model directly pulled from HuggingFace. While there is utility in understanding the types of vulnerabilities manifesting within AI models, it is less helpful towards developers understanding what issues they should prioritize to address.

Jailbreaks

Jailbreaks refer to techniques used to bypass the safety measures and ethical guidelines embedded in large language models (LLMs).

LLMs if prompted persistently and cleverly enough will provide information and solicit harmful outputs being sought after. If you ask an LLM how to make illegal drugs it will, just as a search engine might, lead you to that information via multiple sources introduced within the LLM’s training data, fine-tuning process, or accessed via other components such as RAG databases.

The claim that LLMs expose dangerous knowledge suggests they introduce a fundamentally new risk, but that is misleading. People have accessed harmful or sensitive knowledge through search engines, books, and forums long before LLMs existed. The difference is that LLMs offer a more structured and coherent response, rather than making users sift through multiple sources. Even then, would such Jailbreaking translate into meaningful exploitation? For a customer-facing service, perhaps (even if the user may deliberately attempt to do so to get a screenshot to share), for an internal tool analyzing a data pulled from public sources with no external Internet connection, likely not. The true security challenge isn’t whether a standalone model can be tricked into saying something problematic, it’s whether the AI application itself can be misused in a way that aligns with actual threat models.

The points being made above isn’t that we shouldn’t be red teaming AI models, but rather instead of treating all successful Jailbreaking and Prompt Injection as catastrophic against ‘out-of-the-box’ models in isolation, we should be considering vulnerabilities and potential impact to a specific use case context and real-world threats.

AI Red Teaming Must Focus on Applications, Not Just Models

Source: Pixabay

AI models don’t exist in isolation – they are deployed within applications, wrapped in security controls, guardrails, authentication, filters, and access policies. In our experience, it is here that surfacing vulnerabilities with a clear path to exploitation is most valuable to customers.

Taking a direct quote from Simon Willison who coined the term Prompt Injection:

“If an application doesn’t have access to confidential data and cannot trigger tools that take actions in the world, the risk from prompt injection is limited: you might trick a translation app into talking like a pirate but you’re not going to cause any real harm.

Things get a lot more serious once you introduce access to confidential data and privileged tools.”

In our experience, the overwhelming majority of pre-defined jailbreak techniques found within research papers and open-source AI red teaming tools are unlikely to be effective due to (i) their ephemeral nature as a result of model providers “patching” these specific instructions into the next model version release, and more importantly (ii) their inability to bypass relatively robust system prompts designed for specific application use cases. This isn’t helped with an increased focus of Jailbreaking research pursuing a leaderboard mentality by leveraging broad, yet relatively generic, adversarial benchmarks. 

The term AI red teaming has significant overlap with evaluating models that are often detached from real-world application security. While models are easy to manipulate in isolation (for example our researchers demonstrated that DeepSeek-R1 is highly susceptible to resource exhaustion attacks), full AI applications introduce additional layers of complexity and protection, making exploitation significantly harder. While not the focus of this article, it is worth quickly calling out that AI red teaming is also defined very broadly, and has more in common with penetration testing (although this is the process of simulating cyber-attacks vs. systems/applications, and not isolated models) as opposed to red teaming defined by the offensive security space.

What does make the above attack techniques more effective is the introduction of application context. Below are a set of questions you may want to ask yourself when red teaming AI:

  • What is the AI application's purpose and capabilities?
  • Is the application Internet facing?
  • Does it recognize different encoding and data modalities? (There's a difference in an LLM replying "I'm sorry, I don't understand" and "I'm sorry, I'm unable to parse documents" in terms of surfacing potential vulnerabilities).
  • Do the application responses imply that it is calling some form of external tool?
  • Does the LLM used within the application respond to out-of-scope prompts differently when prompts are masked to be in-line with the applications intended use case?

Red teaming should be about understanding how an AI application (or system) can be exploited by attackers. We are by no means the only team advocating this shift:

 

Yet many organizations are focusing on purely an AI model-level view on red teaming to surface vulnerabilities that have no context to their intended use case. Without assessing AI models deployed within applications, leaving critical attack surfaces untested. Effective red teaming must evolve to mirror how actual adversaries would target an AI-driven application.

The Path Forward for AI Security 

AI isn’t magic - at its core it remains software, data, and hardware. The types of attacks and vulnerabilities possible against AI have direct or analogous comparisons to known cyber attack techniques that security teams have addressed for many years. There are some differences and nuances however due to the opaque and stochastic nature of AI model behavior driving a necessity to update controls and tooling. These differences however shouldn’t detract from prioritizing and mitigating vulnerabilities relevant to your AI application, and ideally that surface after applying established threat modelling and software development practices.