Updated on
December 10, 2025
Model Leeching: cheaply cloning LLM skills for attack staging
Model Leeching shows how attackers can distill ChatGPT-class task knowledge into smaller models for about fifty dollars, then use them to tune follow on attacks.
TABLE OF CONTENTS
Key Takeaways
Key Takeaways
  • Model Leeching extracts task specific behavior from a target LLM using automated prompts and a labelled dataset.
  • We cloned ChatGPT-3.5-Turbo’s question answering behavior into a RoBERTa-Large model using SQuAD and only about fifty dollars in queries.
  • The extracted model reaches 73 percent exact match similarity to ChatGPT responses, with strong SQuAD EM and F1 scores. 
  • Attacks optimized against the extracted model transfer back and raise attack success against ChatGPT-3.5-Turbo by eleven percent.

When organizations integrate large language models into products, they often assume that keeping the model behind an API is enough to protect intellectual property and limit attacker experimentation. The Model Leeching research challenges that assumption.

In this work, we show how a weak adversary, armed only with API access to a production LLM, can distill that system’s task knowledge into a smaller model that runs locally. They then use that clone as a test bed for staging more powerful attacks.

From Mindgard’s standpoint, this is exactly the kind of “amplifier” behavior defenders should expect. If you give adversaries repeated, low friction access to a powerful model, they can cheaply bootstrap their own copy and refine attacks out of sight.

From black box API to local clone

Model Leeching treats the production LLM as a black box. The attacker does not know its architecture, training data, or parameters. They only know that they can send prompts and receive responses.

The attack proceeds in four main phases:

  1. Prompt design
    The attacker builds a prompt template that forces the LLM to respond in a strict and machine readable format. For the experiments in the paper, the team targeted question answering on the SQuAD dataset and instructed the model to output a JSON object containing both the short answer span and the full sentence where it appears.
  2. Data generation
    Using this template, the attacker feeds every SQuAD context and question pair to the target LLM and records its responses, automatically filtering failures or malformed outputs. The result is a large adversarial dataset that pairs inputs with the target model’s answers.
  3. 2309.10544v1 (1)
  4. Extracted model training
    The attacker then fine-tunes a base model, such as RoBERTa-Large, on this generated dataset, treating the target’s responses as labels. This produces an extracted model that learns to mimic the original LLM’s behavior for that specific task.
  5. Attack staging
    Finally, the attacker uses the extracted model as a sandbox. They can query it freely, craft adversarial prompts, and iterate on attack strategies without rate limits or monitoring. Once they find a high success attack against the clone, they replay it against the original LLM.

This pipeline is general. Any LLM with an accessible API and a task that can be expressed as input output pairs is a candidate target.

How effective is the clone?

To make the analysis concrete, we focused on ChatGPT-3.5-Turbo as a target model for question answering. Using the SQuAD dataset as ground truth, they generated labels using ChatGPT and then trained a RoBERTa-Large extracted model on those outputs.

The results are striking:

  • The extracted model achieves 73 percent exact match similarity with ChatGPT-3.5-Turbo’s predictions on SQuAD.
  • It posts SQuAD exact match and F1 scores of 75 percent and 87 percent respectively.
  • The total API cost for data generation is around fifty dollars.

In other words, for roughly the price of a nice dinner, an attacker can train a smaller local model that closely tracks a production LLM’s decisions on a widely used benchmark.

This undermines the notion that hosted LLM APIs provide strong protection from cloning. Even if the proprietary model remains hidden, its behavior can be replicated with high fidelity for specific tasks.

Using the clone to improve attacks

The second half of the paper explores why a stolen clone is useful for adversaries who want more than just free inference. Once a local model exists, attackers can perform unrestricted adversarial testing.

We optimized a language based attack against the extracted model and then transferred that attack back to the original ChatGPT-3.5-Turbo target. They observed an eleven percent increase in attack success compared to directly optimizing against the production model.

This matters because it shows transferability in practice. Even though the clone is smaller and architecturally different, it preserves enough of the target’s decision boundary that attacks tuned against it continue to work.

In a realistic scenario, an adversary could:

  • Use the extracted model to find prompts that bypass guardrails or trigger undesired behaviors.
  • Explore multi step attack chains that would be risky to try repeatedly against the real system.
  • Iterate quickly without tripping rate limits or abuse detection on the production API.

Once they are satisfied with the attack, they can send only the most promising payloads against the real LLM, reducing noise in logs and making detection harder.

Threat model and practical constraints

Model Leeching intentionally assumes a weak adversary. The attacker:

  • Has API level access to the target LLM.
  • Does not know architecture or training data.
  • May be subject to rate limits, but can potentially distribute queries across multiple keys.

Despite these constraints, the attack succeeds because modern LLM APIs are designed for high availability and high throughput. When customers expect to fine tune or chain prompts at scale, they effectively grant the same capability to attackers.

For defenders, the important point is that you do not need a nation state threat actor with custom hardware to face this risk. Any motivated group with modest budget and patience can run a similar extraction campaign.

What Mindgard takes from this work

From Mindgard’s lens, Model Leeching reinforces several design and governance principles:

  • Assume attack staging happens off platform
    Once an adversary has cloned your model or approximated its behavior, much of the dangerous experimentation moves to their own infrastructure. Traditional monitoring that only watches API calls to the production model will miss this activity.
  • Treat task specific behavior as an asset
    Even if an attacker cannot reproduce your full general purpose LLM, they can cheaply distill narrow skills such as domain QA, summarization, or classification. You should model risk at the task level, not only at the model level.
  • Combine behavioral monitoring with adversarial testing
    Owners of LLM powered systems should perform their own adversarial testing to understand which tasks are easiest to emulate and which behaviors transfer most readily to downstream attacks.

The research behind Model Leeching sends a clear signal. Hiding models behind an API does not eliminate model stealing. It simply changes the mechanics. Security teams need to plan for a world where attackers routinely bootstrap local clones and use them to sharpen their tools before they ever touch a production endpoint.

Read the full paper on ArXiv.