Mindgard’s GitHub Action example repository shows how to integrate automated AI security testing into CI/CD pipelines so every model or code change is validated against the latest Mindgard capabilities.
Fergal Glynn

Modern AI red teaming demands a shift in how security teams reason about risk, assurance, and impact. Traditional application security models assume deterministic behavior, stable interfaces, and clearly defined trust boundaries. AI systems violate all three assumptions simultaneously.
In his recent Boston Security Meetup presentation, Aaron Portnoy outlined a practical framework for modern AI red teaming grounded in three interdependent pillars: probabilities, vulnerabilities, and psychometrics. Rather than treating AI security as a subset of model safety or evaluation, the talk reframed it as a real-world offensive discipline focused on how probabilistic cognition, conventional software weaknesses, and behavioral steering intersect to produce exploitable outcomes.

This framing aligns with Mindgard’s attacker-aligned AI security philosophy. Effective AI security does not emerge from abstract benchmarks or static evaluations. It emerges from understanding how attackers actually discover, coerce, and exploit AI behavior in deployed systems.
This article summarizes Aaron’s talk, connects it to Mindgard’s broader philosophy, and explains why modern AI red teaming must operate simultaneously across models, systems, and behavioral incentives.
Download the slide deck here.
Classical software security is built on determinism. Given the same input and system state, a program produces the same output. Vulnerability discovery revolves around identifying reachable bugs through exposed interfaces, then reproducing and exploiting those flaws.
AI systems break this model at a foundational level.
Modern AI applications are probabilistic software systems. Their outputs are sampled from learned distributions, not derived from fixed control flow. The system state is not limited to memory and variables, but includes prompts, latent representations, retrieved context, tool outputs, long-term memory, and external environment signals.
As Aaron emphasized, two executions of an AI system with identical user input can diverge meaningfully in behavior. This divergence is not a bug. It is an inherent property of probabilistic inference.
From a security perspective, this changes the definition of failure.
Traditional software fails when an attacker creates a path to a bug. AI systems fail when an attacker coerces an undesirable outcome from a probability distribution. If a dangerous action occurs one percent of the time, an attacker will simply iterate until they reach it.
This reframing has deep implications for security assurance. Patching no longer means fixing code paths. It often means adjusting prompts, schemas, or policies and hoping that the probability mass shifts in the intended direction. Assurance becomes statistical, not absolute.
AI models do not “reason” in the human sense, but they do follow plausibility paths shaped by learned patterns. Small linguistic changes can shift how the model interprets intent, authority, urgency, or scope.
Aaron highlighted several recurring phenomena relevant to red teaming:
In modern AI products, these probabilistic interpretations do not remain confined to text generation. Most enterprise deployments are no longer simple chatbots. They are policy-driven agents capable of calling tools, writing files, querying internal systems, and executing workflows.
Probabilistic cognition therefore becomes probabilistic action selection. When an agent chooses whether to call a tool, fetch a URL, or execute a command, it is sampling from a distribution shaped by its prompts, policies, and context.
From an attacker’s perspective, this is fertile ground. A misclassification that occurs rarely is still exploitable. AI red teaming must therefore focus on identifying low-probability but high-impact action paths and demonstrating how they can be reached reliably through iteration.
This principle sits at the core of Mindgard’s philosophy: attackers do not need certainty, only repeatability at scale.
Probabilistic behavior alone is not what makes AI systems dangerous. The real security impact emerges when probabilistic decisions intersect with conventional software vulnerabilities.
As Aaron noted, AI systems integrate deeply with familiar infrastructure:
The vulnerability classes exposed through these integrations are not novel. Injection flaws, broken authorization, over-privileged tokens, SSRF-like access patterns, and secrets exposure all persist.
What AI introduces is path explosion. There are dramatically more ways to reach these vulnerabilities because language itself becomes an interface. Any text the model ingests may influence control decisions.
Aaron shared real examples uncovered using Mindgard's technology where seemingly benign tools were abused once an attacker learned how to steer the model’s behavior. In one case, a web fetching capability could be coerced into scanning internal networks, accessing cloud instance metadata, or retrieving sensitive internal dashboards by exploiting alternative encodings of restricted addresses.
These are not theoretical risks. They are manifestations of classic weaknesses amplified by probabilistic control flow.
This reinforces a central Mindgard principle: AI security failures are system-level failures. Evaluating models in isolation misses the true attack surface.
In traditional systems, control and data planes are distinct. Commands, policies, and permissions operate separately from content.
AI systems collapse this separation.
Modern agents routinely ingest ambient context from repositories, tickets, documentation, configuration files, and user messages. All of this context is processed through a shared language interface. The model must infer which inputs are instructions and which are merely data.
As Aaron described, this ambiguity dissolves familiar trust boundaries. Content can become control. Documentation can influence behavior. Comments can override safeguards if framed persuasively enough.
From a defensive standpoint, this means AI systems must be designed assuming occasional misclassification of control versus data. No prompt, policy, or classifier will be perfect.
The only viable compensating controls are strong containment, least privilege, and rigorous system-level testing that assumes failures will occur.
The most novel pillar of Aaron’s framework is psychometrics.
Psychometrics is the science of measuring traits and behavioral propensities. In human security, it underpins decades of research into social engineering, persuasion, and compliance.
AI systems reproduce these dynamics because they are trained on human language and behavior. Authority cues, urgency framing, reciprocity, and consequence signaling all influence how models interpret and act on requests.
Aaron argued that psychometrics provides a structured way to reason about behavioral attack surfaces in AI systems.
Rather than treating prompt injection or jailbreaking as ad hoc tricks, psychometric framing allows red teams to probe measurable dimensions such as:
These traits can be tested systematically across models, prompts, and deployments.
Critically, many AI applications amplify susceptibility through their own incentives. Phrases like “be helpful,” “reduce friction,” “take initiative,” or “assume good intent” are often encoded directly into system prompts, tool descriptions, and UI copy.
These defaults shape baseline behavior. Attackers do not need to invent novel exploits. They search for framing that aligns with the system’s own incentives.
Psychometric traits manifest as consistent, security-relevant failure modes under pressure.
Aaron highlighted recurring patterns observed in real systems:
These are not random failures. They are predictable outcomes of trait encoding combined with probabilistic behavior.
In one example discussed in the talk, specific wording and capitalization choices in a system prompt materially altered how the agent interpreted rules, enabling exploitation. This underscores a key Mindgard insight: system prompts are security-critical configuration, not mere UX artifacts.
Psychometric testing therefore becomes a first-class security activity. By measuring susceptibility directly, defenders can identify the easiest paths attackers will take to reach vulnerable behavior.
The strength of the framework lies in how the pillars reinforce one another.
Modern AI red teaming must operate across all three simultaneously. Focusing on only one produces false confidence.
This integrated view maps directly to Mindgard’s attacker-aligned philosophy. Security is not about proving that systems behave correctly under ideal conditions. It is about demonstrating how they fail under realistic adversarial pressure, then providing concrete evidence of risk reduction.
Aaron concluded with several defensive takeaways that align closely with Mindgard’s approach:
These recommendations move beyond surface-level safeguards. They require continuous, system-level testing that mirrors attacker behavior.
This is why Mindgard emphasizes automated AI red teaming grounded in real exploitation techniques. Static evaluations, safety datasets, and policy checks are necessary but insufficient. Security teams need visibility into how probabilistic behavior, system integrations, and behavioral incentives combine in their actual deployments.
Modern AI red teaming is not an extension of traditional AppSec, nor is it a subset of model safety research. It is a distinct discipline that treats models, machines, and minds as equally exploitable terrain.
Aaron Portnoy’s framework of probabilities, vulnerabilities, and psychometrics provides a practical lens for understanding where real AI security risk emerges. It reinforces a core Mindgard belief: effective AI security must be attacker-aligned, system-level, and grounded in measurable impact.
As AI systems continue to expand their autonomy and integration into critical workflows, the cost of misunderstanding these dynamics will only grow. The teams that succeed will be those that test AI the way attackers do, not the way specifications assume it should behave.
Aaron Portnoy is a security researcher, reverse engineer, speaker, and educator. Aaron has delivered over 30 invited talks across global stages, including Black Hat, USENIX, Recon, Microsoft BlueHat, and the NSA Distinguished Speaker Series. Presentation topics have spanned reverse engineering methodologies, economic drivers of zero-day trade, large-scale vulnerability discovery, and exploit automation. Recognized by academic institutions such as NYU and Dartmouth for his contributions to offensive computing education. His work has also been featured on the cover of TIME Magazine and referenced in numerous university-level curricula. Aaron ran the Zero Day Initiative at TippingPoint/3Com (later Hewlett-Packard and now part of Trend Micro), where he was the architect and judge of the first six competitions of Pwn2Own, a premier hacking event that awards prizes for demonstrating zero-day exploits.