In this talk, Peter Garraghan demonstrates how adversaries are already exploiting AI systems and why current security practices are often ill-equipped to stop them.
Stefan Trawicki

When organizations integrate large language models into products, they often assume that keeping the model behind an API is enough to protect intellectual property and limit attacker experimentation. The Model Leeching research challenges that assumption.
In this work, we show how a weak adversary, armed only with API access to a production LLM, can distill that system’s task knowledge into a smaller model that runs locally. They then use that clone as a test bed for staging more powerful attacks.
From Mindgard’s standpoint, this is exactly the kind of “amplifier” behavior defenders should expect. If you give adversaries repeated, low friction access to a powerful model, they can cheaply bootstrap their own copy and refine attacks out of sight.
Model Leeching treats the production LLM as a black box. The attacker does not know its architecture, training data, or parameters. They only know that they can send prompts and receive responses.
The attack proceeds in four main phases:
This pipeline is general. Any LLM with an accessible API and a task that can be expressed as input output pairs is a candidate target.
To make the analysis concrete, we focused on ChatGPT-3.5-Turbo as a target model for question answering. Using the SQuAD dataset as ground truth, they generated labels using ChatGPT and then trained a RoBERTa-Large extracted model on those outputs.
The results are striking:
In other words, for roughly the price of a nice dinner, an attacker can train a smaller local model that closely tracks a production LLM’s decisions on a widely used benchmark.
This undermines the notion that hosted LLM APIs provide strong protection from cloning. Even if the proprietary model remains hidden, its behavior can be replicated with high fidelity for specific tasks.
The second half of the paper explores why a stolen clone is useful for adversaries who want more than just free inference. Once a local model exists, attackers can perform unrestricted adversarial testing.
We optimized a language based attack against the extracted model and then transferred that attack back to the original ChatGPT-3.5-Turbo target. They observed an eleven percent increase in attack success compared to directly optimizing against the production model.
This matters because it shows transferability in practice. Even though the clone is smaller and architecturally different, it preserves enough of the target’s decision boundary that attacks tuned against it continue to work.
In a realistic scenario, an adversary could:
Once they are satisfied with the attack, they can send only the most promising payloads against the real LLM, reducing noise in logs and making detection harder.
Model Leeching intentionally assumes a weak adversary. The attacker:
Despite these constraints, the attack succeeds because modern LLM APIs are designed for high availability and high throughput. When customers expect to fine tune or chain prompts at scale, they effectively grant the same capability to attackers.
For defenders, the important point is that you do not need a nation state threat actor with custom hardware to face this risk. Any motivated group with modest budget and patience can run a similar extraction campaign.
From Mindgard’s lens, Model Leeching reinforces several design and governance principles:
The research behind Model Leeching sends a clear signal. Hiding models behind an API does not eliminate model stealing. It simply changes the mechanics. Security teams need to plan for a world where attackers routinely bootstrap local clones and use them to sharpen their tools before they ever touch a production endpoint.