Who Guards the Guardrails? Five Questions AI Security Buyers Should Ask

AI guardrails and gateways are becoming a core part of the enterprise AI security stack. They are designed to monitor, detect and block malicious instructions sent to and from large language models (LLMs), AI applications and agents.

They can provide a valuable layer of defence. But they should not be treated as infallible.

In a recent Mindgard webinar, I discussed one of the most important questions facing security teams evaluating these technologies: how can buyers tell whether a guardrail will actually hold up against a motivated attacker?

The answer cannot be reduced to a single accuracy score.

‍

The Problem With Headline Accuracy Claims

Some guardrail vendors report detection accuracy of 95–98% or higher. On the surface, these numbers sound compelling. But an accuracy figure is only meaningful when you understand how it was calculated.

What attacks were included in the test? Which datasets were used? Were the prompts obvious or adaptive? Did the evaluation measure false positives as well as successful detections? Was the guardrail tested in a generic lab environment or against the buyer’s actual AI system?

These details matter because the typical evaluation process can flatter the control being tested.

A vendor-led evaluation will often include a combination of:

Vendor-supplied benchmarks derived from open-source datasets or red-teaming tools.
Widely used independent AI security and safety benchmarks.
Manual probing by the prospective customer.

Each approach can provide useful information. None is sufficient on its own.

Many benchmark datasets contain well-known and relatively obvious attempts, such as instructions to disregard previous prompts or generate prohibited content. A competent guardrail should detect these attacks. But blocking the most recognisable examples is not the same as resisting a determined adversary.

‍

Real Attackers Adapt

An attacker does not stop when an obvious jailbreak fails.

They paraphrase instructions. They translate them. They fragment attacks across multiple turns. They exploit context, manipulate role assumptions and test how the system behaves when instructions are obfuscated or disguised as legitimate requests.

In agentic systems, the stakes are even higher. The risk is not limited to what the model says. An attacker may also attempt to manipulate tool access, permissions, retrieval systems and downstream workflows.

This creates a credibility gap. Buyers may be shown strong benchmark results during an evaluation, only to discover that the same guardrail performs very differently when exposed to more realistic attacker behaviour.

The problem is not that benchmarks are useless. The problem is treating benchmark performance as proof of real-world resilience.

‍

Five Questions Buyers Should Ask

When evaluating a guardrail or AI gateway, security teams should ask:

1. What exactly does the accuracy figure measure?

Ask which datasets, attack categories and thresholds were used. Understand whether the reported score reflects a narrow benchmark or a broader evaluation of realistic adversarial behaviour.

2. Has the guardrail been tested against adaptive attacks?

Static prompts are not enough. Evaluations should include techniques such as multi-turn manipulation, contextual obfuscation, character-level evasion and attempts to disguise malicious intent.

3. Has the control been tested in the environment it will protect?

A generic lab test cannot fully reflect the behaviour of your models, prompts, tools, data flows, permissions and agentic workflows. Guardrails should be evaluated in context.

4. Has the evaluation been independently validated?

Vendor-led testing has an obvious limitation: the vendor controls the methodology. Independent testing can help buyers understand where the control performs well and where its defences begin to break down.

5. How will the guardrail be tested after deployment?

AI systems change. Models are updated. Applications gain new capabilities. Attack techniques evolve. A point-in-time evaluation is not enough. Guardrails require continuous testing against emerging threats.

‍

Guardrails Are a Layer of Defence, Not a Guarantee

The conclusion is not that organisations should avoid guardrails or gateways. These controls can play an important role within a layered AI security architecture.

But buyers should be wary of treating a high detection score as a substitute for evidence.

A guardrail that performs well against familiar benchmark prompts but fails against adaptive attackers can create something more dangerous than an obvious security gap: a false sense of confidence.

Security teams should expect vendors to demonstrate how their controls perform under realistic conditions, against the systems they are intended to protect.

Download the slides from the webinar: Who Guards the Guardrails?

‍