Three Distinct Categories Of AI Red Teaming

Introduction Adversary Simulation: Identifying Vulnerabilities in Deployed AI Model Evaluations: Identifying Dangerous Capabilities in Frontier Models Adversary Simulation: Stealing Frontier Model Weights Conclusion

Introduction

As we’ve covered previously, AI red teaming is a highly effective means of assessing and improving the security of AI systems. The term “red teaming” appears many times throughout recent public policy briefings regarding AI, including:

Voluntary commitments made by leading AI companies to the US Government
President Biden’s executive order regarding AI security
A briefing introducing the UK Government’s AI Safety Institute
The EU Artificial Intelligence Act

Unfortunately, the term “red teaming” is currently doing triple duty in conversations about security for AI, which can be confusing. In this post, we tease apart these three different types of AI red teaming. Each type plays a crucial but distinct role in improving security for AI. Using precise language is an important step towards building a mature ecosystem of AI red teaming services.

Adversary Simulation: Identifying Vulnerabilities in Deployed AI

It is often highly informative to simulate the tactics, techniques, and procedures of threat actors who target deployed AI systems and seek to make the AI behave in ways it wasn’t intended to behave. This type of red teaming engagement might include efforts to alter (e.g., injecting ransomware into a machine learning model file), bypass (e.g., crafting adversarial examples), and steal the model using a carefully crafted sequence of queries. It could also include attacks specific to LLMs, such as various types of prompt injections and jailbreaking.

This type of red teaming is the most common and widely applicable. In nearly all cases where an organization uses AI for a business critical function, it is wise to perform regular, comprehensive stress testing to minimize the chances that an adversary could compromise the system. Here is an illustrative example of this style of red teaming applied by HiddenLayer to a model used by a client in the financial services industry.

In contrast, the second and third categories of AI red teaming are almost always performed on frontier AI labs and frontier models trained by those labs. By “frontier AI model,” we mean a model with state-of-the-art performance on key capabilities metrics. A “frontier AI lab” is a company that actively works to research, design, train, and deploy frontier AI models. For example, DeepMind is a frontier AI lab, and their current frontier model is the Gemini 1.5 model family.

Model Evaluations: Identifying Dangerous Capabilities in Frontier Models

Given compute budget C, training dataset size T, and number of model parameters P, scaling laws can be used to gain a fairly accurate prediction of the overall level of performance (averaged across a wide variety of tasks) that a large generative model will achieve once it has been trained. On the other hand, the level of performance the model will achieve on any particular task appears to be difficult to predict (although this has been disputed). It would be incredibly useful both for frontier AI labs and for policymakers if there were standardized, accurate, and reliable tests that could be performed to measure specific capabilities in large generative models.

High-quality tests for measuring the degree to which a model possesses dangerous capabilities, such as CBRN (chemical, biological, radiological, and nuclear) and offensive cyber capabilities, are of particular interest. Every time a new frontier model is trained, it would be beneficial to be able to answer the following question: To what extent would white box access to this model increase a bad actor’s ability to do harm at a scale above and beyond what they could do just with access to the internet and textbooks? Regulators have been asking for these tests for months:

Voluntary AI commitments to the White House

“Commit to internal and external red-teaming of models or systems in areas including misuse, societal risks, and national security concerns, such as bio, cyber, and other safety areas.”

President Biden’s executive order on AI security

Companies must provide the Federal Government with “the results of any red-team testing that the company has conducted relating to lowering the barrier to entry for the development, acquisition, and use of biological weapons by non-state actors; the discovery of software vulnerabilities and development of associated exploits. . .”

UK AI Safety Institute

“Dual-use capabilities: As AI systems become more capable, there could be an increased risk that

malicious actors could use these systems as tools to cause harm. Evaluations will gauge the

capabilities most relevant to enabling malicious actors, such as aiding in cyber-criminality,

biological or chemical science, human persuasion, large-scale disinformation campaigns, and

weapons acquisition.”

Frontier AI labs are also investing heavily in the development of internal model evaluations for dangerous capabilities:

Anthropic’s Responsible Scaling Policy

“As AI models become more capable, we believe that they will create major economic and social value, but will also present increasingly severe risks. Our RSP focuses on catastrophic risks – those where an AI model directly causes large scale devastation. Such risks can come from deliberate misuse of models (for example use by terrorists or state actors to create bioweapons)…”

OpenAI Preparedness Framework

“We believe that frontier AI models, which will exceed the capabilities currently present in the most advanced existing models, have the potential to benefit all of humanity. But they also pose increasingly severe risks. Managing the catastrophic risks from frontier AI will require answering questions like: How dangerous are frontier AI systems when put to misuse, both now and in the future? How can we build a robust framework for monitoring, evaluation, prediction, and protection against the dangerous capabilities of frontier AI systems? If our frontier AI model weights were stolen, how might malicious actors choose to leverage them?”

DeepMind Frontier Safety Framework

“Identifying capabilities a model may have with potential for severe harm. To do this, we research the paths through which a model could cause severe harm in high-risk domains, and then determine the minimal level of capabilities a model must have to play a role in causing such harm.”

A healthy, truth-seeking debate about the level of risk from misuse of advanced AI will be critical for navigating mitigation measures that are proportional to the risk while not hindering innovation. That being said, here are a few reasons why frontier AI labs and governing bodies are dedicating a lot of attention and resources to dangerous capabilities evaluations for frontier AI:

Developing a mature science of measurement for frontier model capabilities will likely take a lot of time and many iterations to figure out what works and what doesn’t. Getting this right requires planning ahead so that if and when models with truly dangerous capabilities arrive, we will be well-equipped to detect these capabilities and avoid allowing the model to land in the wrong hands.

Many desirable AI capabilities fall under the definition of “dual-use,” meaning that they can be leveraged for both constructive and destructive aims. For example, in order to be useful for aiding in cyber threat mitigation, a model must learn to understand computer networking, cyber threats, and computer vulnerabilities. This capability can be put to use by threat actors seeking to attack computer systems.

Frontier AI labs have already begun to develop dangerous capabilities evaluations for their respective models, and in all cases beginning signs of dangerous capabilities were detected.

Anthropic: “Taken together, we think that unmitigated LLMs could accelerate a bad actor’s efforts to misuse biology relative to solely having internet access, and enable them to accomplish tasks they could not without an LLM. These two effects are likely small today, but growing relatively fast. If unmitigated, we worry that these kinds of risks are near-term, meaning that they may be actualized in the next two to three years, rather than five or more.”

OpenAI: “Overall, especially given the uncertainty here, our results indicate a clear and urgent need for more work in this domain. Given the current pace of progress in frontier AI systems, it seems possible that future systems could provide sizable benefits to malicious actors. It is thus vital that we build an extensive set of high-quality evaluations for biorisk (as well as other catastrophic risks), advance discussion on what constitutes ‘meaningful’ risk, and develop effective strategies for mitigating risk.”

DeepMind: “More broadly, the stronger models exhibited at least rudimentary abilities across all our evaluations, hinting that dangerous capabilities may emerge as a byproduct of improvements in general capabilities. . . We commissioned a group of professional forecasters to predict when models will first obtain high scores on our evaluations, and their median estimates were between 2025 and 2029 for different capabilities.”

NIST recently published a draft of a report on mitigating risks from the misuse of foundation models. They emphasize two key properties that model evaluations should have: (1) Threat actors will almost certainly expand the level of capabilities of a frontier model by giving it access to various tools such as a Python interpreter, an Internet search engine, and a command prompt. Therefore, models should be given access to the best tools available during the evaluation period. Even if a model by itself can’t complete a task that would be indicative of dangerous capabilities, that same model with access to tools may be more than capable. (2) The evaluations must not be leaked into the model’s training data, or else the dangerous capabilities of the model could be overestimated.

Adversary Simulation: Stealing Frontier Model Weights

Whereas the first two types of AI red teaming are relatively new (especially model evaluations), the third type involves applying tried and true network, human, and physical red teaming to the information security controls put in place by frontier AI labs to safeguard frontier model weights. Frontier AI labs are thinking hard about how to prevent model weight theft:

Anthropic’s Responsible Scaling Policy

“Harden security such that non-state attackers are unlikely to be able to steal model weights and advanced threat actors (e.g., states) cannot steal them without significant expense.”

OpenAI’s remarks on securing AI research infrastructure

“Here, we outline our current architecture and operations that support the secure training of frontier models at scale. This includes measures designed to protect sensitive model weights within a secure environment for AI innovation.”

DeepMind’s Frontier Safety Framework

“To allow us to tailor the strength of the mitigations to each [Critical Capability Level], we have also outlined a set of security and deployment mitigations. Higher level security mitigations result in greater protection against the exfiltration of model weights…”

What are model weights, and why are frontier labs so keen on preventing them from being stolen? Model weights are simply numbers that encode the entirety of what was learned during the training process. To “train” a machine learning model is to iteratively tune the values of the model’s weights such that the model performs better and better on the training task.

Frontier models have a tremendous number of weights (for example, GPT-3 has approximately 175 billion weights), and more weights require more time and money to learn. If an adversary were to steal the files containing the weights of a frontier AI model (either through traditional cyber threat operations, social engineering of employees, or gaining physical access to a frontier lab’s computing infrastructure), that would amount to intellectual property theft of tens or even hundreds of millions of dollars.

Additionally, recall that Anthropic, OpenAI, DeepMind, the White House, and the UK AI Safety Institute, among many others, believe it is plausible that scaling up frontier generative models could create both incredibly helpful and destructive capabilities. Ensuring that model weights stay on secure servers closes off one of the major routes by which a bad actor could unlock the full offensive capabilities of these future models. The effects of safety fine-tuning techniques such as reinforcement learning from human feedback (RLHF) and Constitutional AI are encoded in the model’s weights and put up a barrier against asking the stolen model to aid in harming. But this barrier is flimsy in the face of techniques such as LoRA and directional ablation that can be used to quickly, cheaply, and surgically remove these safeguards. A threat actor with access to a model’s weights is a threat actor with full access to any and all offensive capabilities the model may have learned during training.

A recent report from RAND takes a deep dive into this particular threat model and lays out what it might take to prevent even highly resourced and cyber-capable state actors from stealing frontier model weights. The term “red-team” appears 26 times in the report. To protect their model weights, “OpenAI uses internal and external red teams to simulate adversaries and test our security controls for the research environment.”

Note the synergy between the second and third types of AI red teaming. A mature science of model evaluations for dangerous capabilities would allow policymakers and frontier labs to make more informed decisions about what level of public access is proportional to the risks posed by a given model, as well as what intensity of red teaming is necessary to ensure that the model’s weights remain secure. If we can’t know with a high degree of confidence what a model is capable of, we run the risk of locking down a model that turns out to have no dangerous capabilities and forfeiting the scientific benefits of allowing at least somewhat open access to that model, including valuable research on making AI more secure that is enabled by white-box access to frontier models. The other, much more sinister side of the coin is that we could put up too few controls around the weights of a model that we erroneously believe to possess no dangerous capabilities, only to later have the previously latent offensive firepower of that model aimed at us by a threat actor.

Conclusion

As frontier labs and policy makers have been correct in emphasizing, AI red teaming is one of the most powerful tools at our disposal for enhancing the security of AI systems. However, the language currently used in these conversations obscures the fact that AI red teaming is not just a single approach; rather, it involves three distinct strategies, each addressing different security needs.:

Simulating adversaries who seek to alter, bypass, or steal (through inference-based attacks) a model deployed in a business-critical context is an invaluable method of discovering and remediating vulnerabilities. AI red teaming, especially when tailored to large language models (LLM red teaming), provides a focused approach to identifying potential weaknesses and developing strategies to safeguard these systems against misuse and exploitation.

Advancing the science of measuring dangerous capabilities in frontier AI models is critical for policy makers and frontier AI labs who seek to apply regulations and security controls that are proportional to the risks from misuse posed by a given model.

Traditional network, human, and physical red teaming with the objective of stealing frontier model weights from frontier AI labs is an indispensable tool for assessing the readiness of frontier labs to prevent bad actors from taking and abusing their most powerful dual-use models.

Previous Post Next Post

Three Distinct Categories Of AI Red Teaming

Table of Contents

Introduction

Adversary Simulation: Identifying Vulnerabilities in Deployed AI

Model Evaluations: Identifying Dangerous Capabilities in Frontier Models

Adversary Simulation: Stealing Frontier Model Weights

Conclusion

Related Posts

AI Red Teaming Best Practices

AI Threat Landscape Report 2025

DeepSh*t: Exposing the Security Risks of DeepSeek-R1