Summary
Prompt injections and other malicious textual inputs remain persistent and serious threats to large language model (LLM) systems. In this blog, we use the term attacks to describe adversarial inputs designed to override or redirect the intended behavior of LLM-powered applications, often for malicious purposes.
Introduction
Prompt injections, jailbreaks, and malicious textual inputs to LLMs in general continue to pose real-world threats to generative AI systems. Informally, in this blog, we use the word “attacks” to refer to a mix of text inputs that are designed to over-power or re-direct the control and security mechanisms of an LLM-powered application to effectuate a malicious goal of an attacker.
Despite improvements in alignment methods and control architectures, Large Language Models (LLMs) remain vulnerable to text-based attacks. These textual attacks induce an LLM-enabled application to take actions that the developer of the LLM (e.g., OpenAI, Anthropic, or Google) or developer using the LLM in a downstream application (e.g., you!) clearly do not want the LLM to do, ranging from emitting toxic content to divulging sensitive customer data to taking dangerous action, like opening the pod bay doors.
Many of the techniques to override the built-in guardrails and control mechanisms of LLMs rely on exploiting the pre-training objective of the LLM (which is to predict the next token) and the post-training objective (which is to respond to follow and respond to user requests in a helpful-but-harmless way).
In particular, in attacks known as prompt injections, a malicious user prompts the LLM so that it believes it has received new developer instructions that it must follow. These untrusted instructions are concatenated with trusted instructions. This co-mingling of trusted and untrusted input can allow the user to twist the LLM to his or her own ends. Below is a representative prompt injection attempt.
Sample Prompt Injection
The intent of this example seems to be inducing data exfiltration from an LLM. This example comes from the Qualifire-prompt-injection benchmark, which we will discuss later.
These attacks play on the instruction-following ability of LLMs to induce unauthorized action. These actions may be dangerous and inappropriate in any context, or they may be typically benign actions which are only harmful in an application-specific context. This dichotomy is a key aspect of why mitigating prompt injections is a wicked problem.
Jailbreaks, in contrast, tend to focus on removing the alignment protections of the base LLM and exhibiting behavior that is never acceptable, i.e., egregious hate speech.
We focus on prompt injections in particular because this threat is more directly aligned with application-specific security and an attacker’s economic incentives. Unfortunately, as others have noted, jailbreak and prompt injection threats are often intermixed in casual speech and data sets.
Accurately assessing this vulnerability to prompt injections before significant harm occurs is critical because these attacks may allow the LLM to jump out of the chat context by using tool-calling abilities to take meaningful action in the world, like exfiltrating data.
While generative AI applications are currently mostly contained within chatbots, the economic risks tied to these vulnerabilities will escalate as agentic workflows become widespread.
This article examines how existing public datasets can be used to evaluate defense models, meant to detect primarily prompt injection attacks. We aim to equip security-focused individuals with tools to critically evaluate commercial and open-source prompt injection mitigation solutions.
The Bad: Limitations of Existing Prompt Injection Datasets
How should one evaluate a prompt injection defensive solution? A typical approach is to download benchmark datasets from public sources such as HuggingFace and assess detection rates. We would expect a high True Positive Rate (recall) for malicious data and a low False Positive Rate for benign data.
While these static datasets provide a helpful starting point, they come with significant drawbacks:
Staleness: Datasets quickly become outdated as defenders train models against known attacks, resulting in artificially inflated true positive rates.
The dangerousness of an attack is a moving target as base LLMs patch low-hanging vulnerabilities and attackers design novel and stronger attacks. Many datasets over-represent attacks that are weak varieties of DAN (do anything now) or basic instruction-following attacks.
As models evolve, many known attacks are quickly patched, leading to outdated datasets that inflate defensive model performance.
Labeling Biases: Dataset creators often mix distinct problems. For example, prompts that request the LLM to generate content with clear political biases or toxic content, often without an attack technique. Other examples in the dataset may truly be prompt injections that combine a realistic attack technique with a malicious objective.
These political biases and toxic examples are often hyper-local to a specific cultural context and lack a meaningful attack technique. This makes high true positive rates on this data less aligned with a realistic security evaluation.
CTF Over-Representation: Capture-the-flags are cybersecurity contests where white-hat hackers attempt to break a system and test its defenses. Such contests have been extensively used to generate data that is used for training defensive models. These data, while a good start, typically have very narrow attack objectives that do not align well with real-world data. The classic example is inducing an LLM to emit “I have been pwned” with an older variant of Do-Anything-Now.
Although private evaluation methods exist, publicly accessible benchmarks remain essential for transparency and broader accessibility.
The Good: Effective Public Datasets
To navigate the complex landscape of public prompt injection datasets we offer data recommendations categorized by quality. These recommendations are based on our professional opinion as researchers who manage and develop a prompt injection detection model.
Recommended Datasets
- qualifire/Qualifire-prompt-injection-benchmark
- Size: 5,000 rows
- Language: Mostly English
- Labels: 60% benign, 40% jailbreak
This modestly sized dataset is well suited to evaluate chatbots on mostly English prompts. While it is a small dataset relative to others, the data is labeled, and the label noise appears to be low. The ‘jailbreak’ samples contain a mixture of prompt injections and roleplay-centric jailbreaks.
- xxz224/prompt-injection-attack-dataset
- Size: 3,750 rows
- Language: Mostly English
- Labels: None
This dataset combines benign inputs with a variety of prompt injection strategies, culminating in a final “combine attack” that merges all techniques into a single prompt.
- yanismiraoui/prompt_injections
- Size: 1,000 rows
- Languages: Multilingual (primarily European languages)
- Labels: None
This multilingual dataset, primarily featuring European languages, contains short and simple prompt injection attempts. Its diversity in language makes it useful for evaluating multilingual robustness, though the injection strategies are relatively basic.
- jayavibhav/prompt-injection-safety
- Size: 50,000 train, 10,000 test rows
- Labels: Benign (0), Injection (1), Harmful Requests (2)
This dataset consists of a mixture of benign and malicious data. The samples labeled ‘0’ are benign, ‘1’ are prompt injections, and ‘2’ are direct requests for harmful behavior.
Use With Caution
- jayavibhav/prompt-injection
- Size: 262,000 train, 65,000 test rows
- Labels: 50% benign, 50% injection
The dataset is large and features an even distribution of labels. Examples labeled as ‘0’ are considered benign, meaning they do not contain prompt injections, although some may still occasionally provoke toxic content from the language model. In contrast, examples labeled as ‘1’ include prompt injections, though the range of injection techniques is relatively limited. This dataset is generally useful for benchmarking purposes, and sampling a subset of approximately 10,000 examples per class is typically sufficient for most use cases.
- deepset/prompt-injections
- Size: 662 rows
- Languages: English, German, French
- Labels: 63% benign, 37% malicious
This smaller dataset primarily features prompt injections designed to provoke politically biased speech from the target language model. It is particularly useful for evaluating the effectiveness of political guardrails, making it a valuable resource for focused testing in this area.
Not Recommended
- hackaprompt/hackaprompt-dataset
- Size: 602,000 rows
- Languages: Multilingual
- Labels: None
This dataset lacks labels, making it challenging to distinguish genuine prompt injections or jailbreaks from benign or irrelevant data. A significant portion of the prompts emphasize eliciting the phrase “I have been PWNED” from the language model. Despite containing a large number of examples, its overall usefulness for model evaluation is limited due to the absence of clear labeling and the narrow focus of the attacks.
Sample Prompt/Responses from Hackaprompt GPT4o
Here are some GPT4o responses to representative prompts from Hackaprompt.
Informally, these attacks are ‘not even wrong’ in that they are too weak to induce truly malicious or truly damaging content from an LLM. Focusing on this data means focusing on a PWNED detector rather than a real-world threat.
- cgoosen/prompt_injection_password_or_secret
- Size: 82 rows
- Language: English
- Labels: 14% benign, 86% malicious.
This is a small dataset focused on prompting the language model to leak an unspecified password in response to an unspecified input. It appears to be the result of a single individual’s participation in a capture-the-flag (CTF) competition. Due to its narrow scope and limited size, it is not generally useful for broader evaluation purposes.
- cgoosen/prompt_injection_ctf_dataset_2
- Size: 83 rows
- Language: English
This is another CTF dataset, likely created by a single individual participating in a competition. Similar to the previous example, its limited scope and specificity make it unsuitable for broader model evaluation or benchmarking.
- geekyrakshit/prompt-injection-dataset
- Size: 535,000 rows
- Languages: Mostly English
- Labels: 50% ‘0’, 50% ‘1’.
This large dataset has an even label distribution and is an amalgamation of multiple prompt injection datasets. While the prompts labeled as ‘1’ generally represent malicious inputs, the prompts labeled as ‘0’ are not consistently acceptable as benign, raising concerns about label quality. Despite its size, this inconsistency may limit its reliability for certain evaluation tasks.
- imoxto/prompt_injection_cleaned_dataset
- Size: 535,000 rows
- Languages: Multilingual.
- Labels: None.
This dataset is a re-packaged version of the HackAPrompt dataset, containing mostly malicious prompts. However, it suffers from label noise, particularly in the higher difficulty levels (8, 9, and 10). Due to these inconsistencies, it is generally advisable to avoid using this dataset for reliable evaluation.
- Lakera/mosscap_prompt_injection
- Size: 280,000 rows total
- Languages: Multilingual.
- Labels: None.
This large dataset originates from an LLM redteaming CTF and contains a mixture of unlabelled malicious and benign data. Due to the narrow objective of the attacker, lack of structure, and frequent repetition, it is not generally suitable for benchmarking purposes.
The Intriguing: Empirical Refusal Rates
As a sanity check for our opinions of data quality, we tested three good and one low-quality datasets from above by prompting three typical LLMs with the data and computed the models’ refusal rates. A refusal is when an LLM thinks a request is malicious based on its post-training and declines to answer or comply with the request.
Refusal rates provide a rough proxy for how threatening the input appears to the model, but beware: the most dangerous attacks don’t trigger refusals because the model silently complies.
Note that this measured refusal rate is only a proxy for the real-world threat. For the strongest real-world jailbreak and prompt injection attacks, the refusal rate will be very low, obviously, because the model quietly complies with the attacker’s objective. So we are really testing that the data is of medium quality (i.e., threatening enough to induce a refusal but not so dangerous that it actually forces the model to comply).
The high-quality benign data does have these very low refusal fractions, as expected, so that is a good sanity check.
When we compare Hackaprompt with the higher-quality malicious data in Qualifire/Yanismiraoui, we see that the Hackaprompt data has a substantially lower refusal fraction than the higher malicious-quality data, confirming our qualitative impressions that models do not find it threatening. See the representative examples above.
Dataset | Label | GPT-4o | Claude 3.7 Sonnet | Gemini 2.0 Flash | Average |
Casual Conversation | 0 | 1.6% | 0% | 4.4% | 2.0% |
Qualifire | 0 | 10.4% | 6.4% | 10.8% | 9.2% |
Hackaprompt | 1 | 30.4% | 24.0% | 26.8% | 27.1% |
Yanismiraoui | 1 | 72.0% | 32.0% | 74.0% | 59.3% |
Qualifire | 1 | 73.2% | 61.6% | 63.2% | 66.0% |
Average Refusal Rates by Model/Label/Dataset Source, each bin has an average of 250 samples.
Interestingly, Claude 3.7 Sonnet has systematically lower refusal rates than other models, suggesting stronger discrimination between benign and malicious inputs, which is an encouraging sign for reducing false positives.
The low refusal rate for Yanismiraoui and Claude 3.7 Sonnet is an artifact of our refusal grading system for this on-off experiment, rather than an indication that the dataset is low quality.
Based on this sanity check, we advocate that security-conscious users of LLMs continue to seek out more extensive evaluations to align the LLM’s inductive bias with the data they see in their exact application. In this specific experiment, we are testing how much this public data aligns or does not align with the specific helpfulness/harmlessness tradeoff encoded in the base LLM by a model provider’s specific post-training choices. That might not be the right trade-off for your application.
What to Make of These Numbers
We do not want to publish truly dangerous data publicly to avoid empowering attackers, but we can confirm from our extensive experience cracking models that even average-skill attackers have many effective tools to twist generative models to their own ends.
Evals are very complicated in general and are an active research topic throughout generative AI. This blog provides rough and ready guidance for security professionals who need to make tough decisions in a timely manner. For application-specific advice, we stand ready to provide detailed advice and solutions for our customers in the form of datasets, red-teaming, and consulting.
It is hard to effectively evaluate model security, especially as attackers adapt to your specific AI system and protective models (if any). Historical trends suggest a tendency to overestimate defense effectiveness, echoing issues seen previously in supervised classification contexts (Carlini et al., 2020). The flawed nature of existing datasets compounds this issue, necessitating careful and critical usage of available resources.
In particular, testing LLM defenses in an application-specific context is truly necessary to test for real-world security. General-purpose public jailbreak datasets are not generally suited for that requirement. Effective and truly harmful attacks on your system are likely to be far more domain-specific and harder to distinguish from benign traffic than anything you’d find in a publicly sourced prompt dataset. This alignment is a key part of our company’s mission and will be a topic of future blogging.
The risk of overconfidence in weak public evaluation datasets points to the need for protective models and red-teaming from independent AI security companies like HiddenLayer to fully realize AI’s economic potential.
Conclusion
Evaluating prompt injection defensive models is complex, especially as attackers continuously adapt. Public datasets remain essential, but their limitations must be clearly understood. Recognizing these shortcomings and leveraging the most reliable resources available enables more accurate assessments of generative AI security. Improved benchmarks and evaluation methods are urgently needed to keep pace with evolving threats moving forward.
HiddenLayer is responding to this security challenge today so that we can prevent adversaries from attacking your model tomorrow.