Modern AI Red Teaming: Probabilities, Vulnerabilities, and Psychometrics

Updated on

January 22, 2026

A technical exploration of modern AI red teaming, examining how probabilistic behavior, classic vulnerabilities, and psychometric steering combine to create real-world AI security risk.

Fergal Glynn

TABLE OF CONTENTS

Key Takeaways

AI security failures are probabilistic, not deterministic. Modern AI systems fail when attackers repeatedly coerce low-probability behaviors until a high-impact action occurs, which breaks traditional assumptions about reproducibility, patching, and assurance.

Classic vulnerabilities still drive real impact, but AI multiplies the paths to reach them. Probabilistic decision-making, collapsed control and data boundaries, and deep system integrations dramatically expand how attackers can trigger familiar weaknesses at scale.

Psychometric susceptibility is a first-class attack surface. Behavioral incentives encoded in system prompts, tools, and UX create measurable failure modes that attackers exploit, making continuous, attacker-aligned AI red teaming essential for managing real-world AI risk.

Modern AI red teaming demands a shift in how security teams reason about risk, assurance, and impact. Traditional application security models assume deterministic behavior, stable interfaces, and clearly defined trust boundaries. AI systems violate all three assumptions simultaneously.

In his recent Boston Security Meetup presentation, Aaron Portnoy outlined a practical framework for modern AI red teaming grounded in three interdependent pillars: probabilities, vulnerabilities, and psychometrics. Rather than treating AI security as a subset of model safety or evaluation, the talk reframed it as a real-world offensive discipline focused on how probabilistic cognition, conventional software weaknesses, and behavioral steering intersect to produce exploitable outcomes.

Aaron Portnoy

‍

This framing aligns with Mindgard’s attacker-aligned AI security philosophy. Effective AI security does not emerge from abstract benchmarks or static evaluations. It emerges from understanding how attackers actually discover, coerce, and exploit AI behavior in deployed systems.

This article summarizes Aaron’s talk, connects it to Mindgard’s broader philosophy, and explains why modern AI red teaming must operate simultaneously across models, systems, and behavioral incentives.

Download the slide deck here.

‍

From Deterministic Software to Probabilistic Systems

Classical software security is built on determinism. Given the same input and system state, a program produces the same output. Vulnerability discovery revolves around identifying reachable bugs through exposed interfaces, then reproducing and exploiting those flaws.

AI systems break this model at a foundational level.

Modern AI applications are probabilistic software systems. Their outputs are sampled from learned distributions, not derived from fixed control flow. The system state is not limited to memory and variables, but includes prompts, latent representations, retrieved context, tool outputs, long-term memory, and external environment signals.

As Aaron emphasized, two executions of an AI system with identical user input can diverge meaningfully in behavior. This divergence is not a bug. It is an inherent property of probabilistic inference.

From a security perspective, this changes the definition of failure.

Traditional software fails when an attacker creates a path to a bug. AI systems fail when an attacker coerces an undesirable outcome from a probability distribution. If a dangerous action occurs one percent of the time, an attacker will simply iterate until they reach it.

This reframing has deep implications for security assurance. Patching no longer means fixing code paths. It often means adjusting prompts, schemas, or policies and hoping that the probability mass shifts in the intended direction. Assurance becomes statistical, not absolute.

‍

Probabilistic Cognition and Action Selection

AI models do not “reason” in the human sense, but they do follow plausibility paths shaped by learned patterns. Small linguistic changes can shift how the model interprets intent, authority, urgency, or scope.

Aaron highlighted several recurring phenomena relevant to red teaming:

Ambiguity collapse, where subtle phrasing differences cause the model to infer different implied goals.
Context sensitivity, where salience is altered by surrounding language or retrieved documents.
Instruction weighting, where cues associated with authority or obligation are prioritized over others.

In modern AI products, these probabilistic interpretations do not remain confined to text generation. Most enterprise deployments are no longer simple chatbots. They are policy-driven agents capable of calling tools, writing files, querying internal systems, and executing workflows.

Probabilistic cognition therefore becomes probabilistic action selection. When an agent chooses whether to call a tool, fetch a URL, or execute a command, it is sampling from a distribution shaped by its prompts, policies, and context.

From an attacker’s perspective, this is fertile ground. A misclassification that occurs rarely is still exploitable. AI red teaming must therefore focus on identifying low-probability but high-impact action paths and demonstrating how they can be reached reliably through iteration.

This principle sits at the core of Mindgard’s philosophy: attackers do not need certainty, only repeatability at scale.

‍

When Probabilities Meet Classic Vulnerabilities

Probabilistic behavior alone is not what makes AI systems dangerous. The real security impact emerges when probabilistic decisions intersect with conventional software vulnerabilities.

As Aaron noted, AI systems integrate deeply with familiar infrastructure:

Internal REST APIs and RPC services
Filesystems, repositories, and CI/CD pipelines
Cloud consoles, IAM roles, and metadata services
Ticketing systems, document stores, and knowledge bases

The vulnerability classes exposed through these integrations are not novel. Injection flaws, broken authorization, over-privileged tokens, SSRF-like access patterns, and secrets exposure all persist.

What AI introduces is path explosion. There are dramatically more ways to reach these vulnerabilities because language itself becomes an interface. Any text the model ingests may influence control decisions.

Aaron shared real examples uncovered using Mindgard's technology where seemingly benign tools were abused once an attacker learned how to steer the model’s behavior. In one case, a web fetching capability could be coerced into scanning internal networks, accessing cloud instance metadata, or retrieving sensitive internal dashboards by exploiting alternative encodings of restricted addresses.

These are not theoretical risks. They are manifestations of classic weaknesses amplified by probabilistic control flow.

This reinforces a central Mindgard principle: AI security failures are system-level failures. Evaluating models in isolation misses the true attack surface.

‍

The Collapse of Control and Data Boundaries

In traditional systems, control and data planes are distinct. Commands, policies, and permissions operate separately from content.

AI systems collapse this separation.

Modern agents routinely ingest ambient context from repositories, tickets, documentation, configuration files, and user messages. All of this context is processed through a shared language interface. The model must infer which inputs are instructions and which are merely data.

As Aaron described, this ambiguity dissolves familiar trust boundaries. Content can become control. Documentation can influence behavior. Comments can override safeguards if framed persuasively enough.

From a defensive standpoint, this means AI systems must be designed assuming occasional misclassification of control versus data. No prompt, policy, or classifier will be perfect.

The only viable compensating controls are strong containment, least privilege, and rigorous system-level testing that assumes failures will occur.

‍

Psychometrics: Measuring Susceptibility at Scale

The most novel pillar of Aaron’s framework is psychometrics.

Psychometrics is the science of measuring traits and behavioral propensities. In human security, it underpins decades of research into social engineering, persuasion, and compliance.

AI systems reproduce these dynamics because they are trained on human language and behavior. Authority cues, urgency framing, reciprocity, and consequence signaling all influence how models interpret and act on requests.

Aaron argued that psychometrics provides a structured way to reason about behavioral attack surfaces in AI systems.

Rather than treating prompt injection or jailbreaking as ad hoc tricks, psychometric framing allows red teams to probe measurable dimensions such as:

Deference to authority
Sensitivity to urgency
Helpfulness bias
Risk tolerance
Policy adherence
Verification appetite

These traits can be tested systematically across models, prompts, and deployments.

Critically, many AI applications amplify susceptibility through their own incentives. Phrases like “be helpful,” “reduce friction,” “take initiative,” or “assume good intent” are often encoded directly into system prompts, tool descriptions, and UI copy.

These defaults shape baseline behavior. Attackers do not need to invent novel exploits. They search for framing that aligns with the system’s own incentives.

‍

Trait Encoding and Measurable Failure Modes

Psychometric traits manifest as consistent, security-relevant failure modes under pressure.

Aaron highlighted recurring patterns observed in real systems:

Verification collapse, where the model stops checking assumptions.
Boundary drift, where scope expands incrementally.
Authority mis-weighting, where certain phrasing overrides safeguards.
Action inflation, where the agent takes unnecessary or overly broad actions.
Disclosure creep, where sensitive information is gradually revealed.

These are not random failures. They are predictable outcomes of trait encoding combined with probabilistic behavior.

In one example discussed in the talk, specific wording and capitalization choices in a system prompt materially altered how the agent interpreted rules, enabling exploitation. This underscores a key Mindgard insight: system prompts are security-critical configuration, not mere UX artifacts.

Psychometric testing therefore becomes a first-class security activity. By measuring susceptibility directly, defenders can identify the easiest paths attackers will take to reach vulnerable behavior.

‍

Connecting the Three Pillars

The strength of the framework lies in how the pillars reinforce one another.

Probabilities determine how often a risky behavior occurs.
Vulnerabilities determine the blast radius when it does.
Psychometrics determine how easily attackers can steer the system toward those behaviors.

Modern AI red teaming must operate across all three simultaneously. Focusing on only one produces false confidence.

This integrated view maps directly to Mindgard’s attacker-aligned philosophy. Security is not about proving that systems behave correctly under ideal conditions. It is about demonstrating how they fail under realistic adversarial pressure, then providing concrete evidence of risk reduction.

‍

Defensive Implications for AI Security Teams

Aaron concluded with several defensive takeaways that align closely with Mindgard’s approach:

Design AI systems assuming predictable persuasion attempts will occur.
Enforce explicit confirmation and verification for side-effect-producing actions.
Treat susceptibility testing as a routine part of security evaluation.
Contain behavior through least privilege and strong isolation

These recommendations move beyond surface-level safeguards. They require continuous, system-level testing that mirrors attacker behavior.

This is why Mindgard emphasizes automated AI red teaming grounded in real exploitation techniques. Static evaluations, safety datasets, and policy checks are necessary but insufficient. Security teams need visibility into how probabilistic behavior, system integrations, and behavioral incentives combine in their actual deployments.

‍

Conclusion

Modern AI red teaming is not an extension of traditional AppSec, nor is it a subset of model safety research. It is a distinct discipline that treats models, machines, and minds as equally exploitable terrain.

Aaron Portnoy’s framework of probabilities, vulnerabilities, and psychometrics provides a practical lens for understanding where real AI security risk emerges. It reinforces a core Mindgard belief: effective AI security must be attacker-aligned, system-level, and grounded in measurable impact.

As AI systems continue to expand their autonomy and integration into critical workflows, the cost of misunderstanding these dynamics will only grow. The teams that succeed will be those that test AI the way attackers do, not the way specifications assume it should behave.

‍

About the Speaker

Aaron Portnoy is a security researcher, reverse engineer, speaker, and educator. Aaron has delivered over 30 invited talks across global stages, including Black Hat, USENIX, Recon, Microsoft BlueHat, and the NSA Distinguished Speaker Series. Presentation topics have spanned reverse engineering methodologies, economic drivers of zero-day trade, large-scale vulnerability discovery, and exploit automation. Recognized by academic institutions such as NYU and Dartmouth for his contributions to offensive computing education. His work has also been featured on the cover of TIME Magazine and referenced in numerous university-level curricula. Aaron ran the Zero Day Initiative at TippingPoint/3Com (later Hewlett-Packard and now part of Trend Micro), where he was the architect and judge of the first six competitions of Pwn2Own, a premier hacking event that awards prizes for demonstrating zero-day exploits.

‍

PINCH framework: mapping how model stealing attacks really work

PINCH is an automated framework that runs large-scale extraction attacks across deep learning architectures to reveal how and when model stealing actually succeeds.

Hunting AI Application Vulnerabilities With Burp

In this article we’ll walk through hunting for AI application vulnerabilities. We’ll use Mindgard to find application vulnerabilities in a deliberately-vulnerable LLM lab application made available by PortSwigger.

Model Leeching: cheaply cloning LLM skills for attack staging

Model Leeching shows how attackers can distill ChatGPT-class task knowledge into smaller models for about fifty dollars, then use them to tune follow on attacks.

Mindgard, the leading provider of Artificial Intelligence security solutions, helps enterprises secure their AI models, agents, and systems across the entire lifecycle. Mindgard’s solution uncovers shadow AI, conducts automated AI red teaming by emulating adversaries, and delivers runtime protection against attacks like prompt injection and agentic manipulation. Trusted by leading organizations in finance, healthcare, and technology, Mindgard is backed by investors including .406 Ventures, IQ Capital, Atlantic Bridge, and Lakestar.