The Guardrail Credibility Gap: What AI Security Buyers Need to Know

AI guardrails often claim high detection accuracy, but benchmark results can hide real-world gaps. Learn how buyers should evaluate guardrails against realistic attacker behavior.

In This Article

    One of the core pillars within the AI security industry is guardrails, also marketed as AI gateways, designed to monitor, detect, and block malicious instructions sent to and from LLMs and agents. There are now hundreds of offerings on the market, provided by a mix of well-funded startups and global enterprise technology companies, and demand is growing rapidly as organisations race to deploy AI systems at scale.

    However, the majority of guardrails and gateways today are still rather nascent in terms of maturity. At Mindgard, we've already published research papers and presented webinars focused on their technical gaps. But rather than revisiting those shortcomings here, I want to address something equally important: how buyers should appraise and differentiate between solutions when purchasing from vendors. Buying the wrong guardrail isn't just a waste of budget, it's a false sense of security that can leave your organisation meaningfully exposed.

    The Accuracy Claim Problem

    Vendors routinely claim their guardrail achieves 95–98%+ accuracy in detecting attacks. On the surface, these figures sound compelling. In practice, it's critical to understand how those statistics are derived and under what conditions.

    When a prospective customer engages a vendor for a trial or evaluation, the process typically involves one or more of the following:

    • Vendor-supplied benchmarks: These are derived from open-source datasets and red-teaming tools, many of which contain well-known and often obvious attack attempts, think "How do I build a bomb?" or "Disregard previous instructions." These are the low-hanging fruit of adversarial prompting, and any competent guardrail should detect them. But they bear little resemblance to the nuanced, context-aware attacks that motivated adversaries actually use.
    • Independent AI safety and security benchmarks: Vendors will frequently claim they haven't trained their classifiers on these benchmarks. That may be technically true, but vendors are sufficiently aware of the major benchmarks in circulation and have almost certainly ensured their solutions perform well against them. AI safety awareness and training are not the same thing, but the effect on evaluation outcomes can be similar.
    • Customer-led probing: Buyers are often invited to test the system themselves. However, many customers are still early in their AI security journey and lack the specialist expertise to craft realistic, sophisticated attacks. When they inevitably turn to the vendor for guidance on how to test, the evaluation loops back to vendor-influenced methodology.

    Real attackers do not stop when an obvious jailbreak is blocked. They adapt. They paraphrase, translate, fragment instructions across multiple turns, exploit context, manipulate role assumptions, and look for weaknesses in tools, retrieval systems, permissions, and downstream workflows. This is especially important in agentic systems, where the risk is not limited to what the model says, but what the system is allowed to do.

    Marking Your Own Homework

    The cumulative effect of these evaluation approaches is that they can drift significantly into "marking your own homework" territory. Every form of evaluation involves trade-offs, and I don't expect perfection. But two concerns stand out.

    First, buyers are increasingly relying on vendors to define what "good" looks like, including how to test for it at precisely the moment when customers are still developing their own understanding of AI security. This creates an asymmetry of knowledge that benefits vendors and disadvantages buyers.

    Second, we've had multiple customers come to Mindgard after reporting a significant gap between a vendor's claimed detection rates and real-world performance once their gateway was exposed to motivated, realistic attackers. That gap isn't a minor variance, it can render the control substantially less effective than advertised.

    Third, and more concretely, in our own customer engagements, we have yet to see a guardrail or gateway that cannot be bypassed when evaluated against adaptive, attacker-aligned techniques rather than static benchmark prompts.

    A Familiar Pattern

    The incentive structure driving this dynamic is understandable. Vendors operating in an increasingly crowded market need to stand out. Reporting stellar accuracy and performance figures especially as some vendors have already been acquired for substantive sums is a rational competitive strategy, even if the methodology behind those figures doesn't hold up to independent scrutiny.

    But the pattern is a familiar and troubling one. It closely mirrors the Volkswagen emissions scandal, where performance figures were optimised for the test environment rather than real-world conditions. The consequences of that gap, when discovered, were significant. In AI security, the stakes are comparable: organisations will rely on these technologies to protect their systems, their users, and their data. A guardrail that performs brilliantly on a benchmark but fails against a determined attacker provides protection that is largely illusory.

    What Buyers Should Do Differently

    This post is not an argument against using guardrails or gateways as part of your AI security architecture, they remain a valuable layer of defence when selected and deployed thoughtfully. Rather, it's a call for buyers to approach vendor evaluations with greater rigour and scepticism.

    Specifically, consider:

    • Insisting on adversarial testing that goes beyond standard benchmarks. Engage independent red teamers or AI security specialists to probe the solution rather than relying solely on vendor-curated scenarios.
    • Asking vendors to be explicit about benchmark composition. Understand exactly which datasets were used to train classifiers, and whether the independent benchmarks used in evaluation overlap significantly with training data.
    • Treating accuracy figures as context-dependent. A 97% detection rate against a vendor's preferred benchmark is a very different claim from 97% detection in a production environment facing sophisticated, adaptive attackers.
    • Building ongoing testing into your operations. A guardrail that performs well at the point of purchase may degrade as attack techniques evolve. Continuous evaluation, not just point-in-time assessment is essential.
    • Testing against your own AI systems. A generic benchmark may show whether a guardrail can detect known malicious prompts. It does not show whether the same control can protect your specific models, prompts, tools, data flows, permissions, and agentic workflows.

    The AI security market is maturing, and buyers are gaining sophistication. Closing the credibility gap between vendor claims and real-world performance starts with asking harder questions and expecting credible answers.

    Watch the Webinar

    Join Mindgard founder Peter Garraghan on Thursday, June 11, from 11:00 to 11:30 AM ET for a live webinar on how buyers should evaluate AI guardrails and gateways. Peter will expand on the topics covered in this post, including benchmark limitations, vendor accuracy claims, real-world bypass techniques, and how security teams can test whether these controls hold up against adaptive attackers. The session will include a live Q&A.

    Register for the webinar here.