How to build a red teaming dataset
Many teams start by documenting jailbreak attempts in a spreadsheet. Those findings often stay disconnected from the development cycle, so the model ships carrying the same gaps the team identified. A flat list of adversarial prompts is not a safety asset. 97% of enterprise organizations have already faced generative AI security incidents, while 91% don't feel prepared to implement generative AI safely. Red teaming done as a one-time event produces those numbers. Treating it as a dataset to own, maintain, and feed into a triage process is what changes them.
TL;DR
A red teaming dataset is schema-governed, not a flat list of adversarial prompts.
Start with a published taxonomy, then narrow it to your deployment's actual risk surface.
Each record needs a prompt, harm category, severity rating, and a pre-specified expected behavior.
Source prompts from three tiers: production logs, domain experts, and agentic generation.
Any model update is a dataset maintenance trigger; stale records inflate apparent safety coverage.
What a red teaming dataset is (and how it differs from a jailbreak list)
A jailbreak list is a flat collection of prompts someone used to make a model say something it shouldn't. A red teaming dataset is a structured repository. Each record captures the prompt, the harm category, a severity rating, and the correct model behavior. That last field (expected behavior) is what transforms a finding into something a fine-tuning pipeline can actually use.
Without that structure, you can't compare results across model versions, hand findings to an annotation team, or tell whether a patch actually fixed the failure or just moved it. The schema isn't overhead. It's what turns a red teaming exercise into usable data.
Define your risk taxonomy before writing a single prompt
Prompts written without a taxonomy tend to cluster around the obvious. Teams produce dozens of entries targeting explicit toxicity and almost nothing covering data exfiltration, indirect prompt injection, or context manipulation. Those are the categories that surface most frequently in production.
Start with a published reference
RedBench pulls together 37 benchmark datasets into a single repository of 29,362 samples. It organizes them across 22 risk categories and 19 domains, giving teams a consistent starting point instead of building their own taxonomy from scratch. Use it as a foundational reference, then select the categories relevant to your deployment.
Narrow it to your use case
A published taxonomy gives you categories. Your deployment gives you priorities. Research involving 22 red teaming practitioners found that treating risk as universal leads teams to overlook interaction type and user specificity. Applying the same categories to every model in every context creates gaps between the dataset and the actual risk surface.
Before you write a single prompt, map your deployment to the taxonomy. A FinTech assistant needs prompts that try to facilitate fraud, evade regulations, or take over accounts. A healthcare assistant needs dangerous dosage advice, PII leakage, and unauthorized clinical recommendations. A customer service bot needs policy circumvention and PII extraction through social engineering.
Write down that scoped list. It determines which risk categories get 30 entries and which get 5.
Design a schema for each dataset record
A schema turns a prompt into a reusable record where every field serves a specific downstream function.
The fields
id: A unique identifier. You'll need this when a patched model closes a vulnerability and you want to graduate that record to a regression suite without duplicating it.
category: The top-level risk category from your taxonomy. "Harmful content" is not a category. "Dangerous medical advice" is.
subcategory: The failure mode within the category. Under "data exfiltration," subcategories might include "system prompt extraction," "PII leakage via indirect injection," and "context window manipulation."
prompt: The adversarial input. Record it exactly as written. Paraphrasing later introduces ambiguity about what the model was actually responding to.
source_tier: Where the prompt came from: production log, expert-authored, or agentic generation. Prompt provenance affects how much weight a finding carries in triage. A prompt that surfaced in a real user session carries different evidential weight than one a researcher imagined.
expected_behavior: A testable description of what a correctly aligned model produces. Not "the model should refuse." Something like: "The model declines to provide specific dosage information, states it is not a substitute for a licensed clinician, and offers to help the user locate a qualified provider." Write this field before you run the model against the prompt.
severity: A 1-to-3 rating that maps to downstream triage. Severity-1 items block a release. Severity-2 items require a fix before the next sprint. Severity-3 items enter the backlog and get addressed in routine maintenance cycles.
tags: Free-form metadata including regulatory relevance, model version, annotation round, and domain. Useful for filtering during maintenance.
Why structured records matter at scale
Anthropic's red teaming dataset contains 38,961 attacks. Its reusability across research contexts comes from structured metadata on each record, not a raw prompt dump. The Japan AI Safety Institute's methodology makes the same point in its 15-step process. Safety dimensions like fairness, privacy, and transparency are named, structured inputs to the dataset, not post-hoc labels applied after evaluation. A schema that captures those dimensions from the start is what makes a dataset transferable across teams and model versions.
Source adversarial prompts from three tiers of input
A single sourcing method produces a skewed dataset. Human researchers tend to imagine the attacks they'd try. That leaves out the attacks real users are already trying, and the attacks no human would think of at all.
Use three tiers:
Production logs. Real conversation data is the most honest source of adversarial behavior. Users probe model boundaries in ways that don't appear in any academic dataset. RICoTA, a 609-prompt Korean red teaming dataset, was built entirely from in-the-wild user dialogues. The attack patterns it captured do not replicate in synthetic datasets. If you have production traffic, mine it for boundary-pushing conversations and route flagged examples into the dataset with their source tier labeled.
Expert-authored prompts. Domain SMEs know what a real fraud attempt looks like in a FinTech context, or what a dangerous clinical edge case sounds like in healthcare. General-purpose red teamers don't. Recruit domain experts to write prompts in the subcategories your taxonomy identified as high-priority. These entries carry the most evidentiary weight in severity triage.
Agentic generation. An AI system can evaluate a target model continuously, identifying edge cases and failure modes that manual review tends to miss. Agentic red teaming compresses weeks of work into hours, surfacing hundreds of weaknesses at a pace human-only teams can't match. Use it to fill coverage gaps after the first two tiers are exhausted. Label all agentic-generated prompts in the source_tier field so reviewers know those entries need human validation before they're treated as ground truth.
Annotate expected behavior before you run a single evaluation
The expected_behavior field is the hardest field to fill. It's also the one most teams skip. Skipping it means every evaluator is judging model outputs against an implicit, shifting standard.
Pre-specifying behavior requires the team to agree on what "correct" looks like before they ever see a model output. Reaching that agreement early surfaces definitional disagreements when they're cheap to resolve. Found during evaluation, the same disagreement invalidates the whole run.
For low-stakes categories, a product manager and a developer can write expected behavior together. For high-stakes verticals, that's not sufficient. A major health research institution coordinated 32 subject matter experts and 20 annotators. They completed more than 20,000 annotation tasks, evaluating a GenAI assistant against criteria for interpretability, accuracy, and NIH standards alignment. Each task produced sentence-level judgments for every dimension. The resulting benchmark dataset calibrated an LLM-as-a-judge evaluator for large-scale automated testing. Investing in human annotation up front reduced the ongoing evaluation cost. That's the workflow: human-verified expected behavior first, automated evaluation calibrated against it second. HumanSignal's platform coordinated that annotation and review cycle. The Mind Moves case study describes the full workflow, including the hybrid LLM-as-a-judge pipelines it enabled.
Retire and evolve entries as the model changes
The dataset is never done. When a model handles a record correctly, either move it to a regression test so the fix is confirmed with every future release, or retire it with a note explaining why: a model patch, a retraining run, or a change in deployment context.
Teams that skip this step accumulate stale entries. A dataset full of vulnerabilities the current model no longer exhibits looks like safety coverage. It isn't. It's a count of old failures.
Three events should trigger a maintenance pass:
A model update, including fine-tuning runs and RLHF cycles
A product expansion into a new domain or user population
A shift in user behavior detected in production logs, with new failure modes the current dataset doesn't cover
Red teaming datasets drift from real-world risk as deployment conditions change. A dataset built for a customer service assistant covers different risks than one for billing disputes on the same underlying model.
For teams without internal SME capacity for ongoing curation, HumanSignal Services provides experts to annotate difficult data types, including red teaming. Specialists are sourced for the specific risk categories relevant to the deployment rather than drawn from general annotator pools.
One honest qualification: if you're shipping a low-stakes internal tool with no regulated data, a taxonomy-first build is slower than the risk warrants. In that case, annotating 200 to 300 entries from Anthropic's public 38,961-attack dataset for your deployment produces a workable starting point in days. The full schema and retirement workflow apply most directly to production systems where a single failure carries legal, reputational, or patient-safety consequences.
A red teaming dataset that isn't maintained is a false safety certificate
The gap between "we ran some tests" and "we have a safety asset" is not the number of prompts you wrote. It's whether those prompts exist inside a schema that supports triage, annotation, and retirement. A list of adversarial prompts tells you what the model failed at once. A maintained dataset tells you what the model fails at now.
Every model update resets that question. Treat it that way, and the structured LLM evaluations you build become a compounding asset rather than a one-time artifact. A static dataset starts decaying the moment the model it tested is patched.
How many records should a minimum viable red teaming dataset contain?
A functional starting point for production models typically requires 200 to 300 high-quality records per high-priority risk category. While foundational research datasets like Anthropic's red teaming release contain nearly 39,000 attacks, smaller teams can achieve initial calibration by focusing on the specific failure modes most likely to occur in their deployment domain.
When should I use a public dataset versus building a private one?
Public datasets like RedBench are effective for baseline safety testing against universal harms like toxicity or basic jailbreaks. However, private datasets are necessary for domain-specific risks, such as medical advice or financial fraud, where public benchmarks often suffer from "signal leak" and fail to capture the specific normative dissonance of localized deployments.
How do I structure a schema for multi-turn attack sequences?
Multi-turn attacks, such as Crescendo sequences, require a schema that supports a session_id to group related prompts and a turn_index to preserve the conversation order. Instead of a single prompt field, the record should store the full dialogue history to capture how the model's refusal threshold degrades over multiple interactions.
How often should I refresh my red teaming dataset?
You should trigger a dataset refresh after every major model update, including fine-tuning runs or RLHF cycles. Stale records that the model has already been "hardened" against provide a false sense of security, so entries should be retired or evolved as the model's failure modes shift and new production logs surface.
What sourcing approach works best without production conversation logs?
Teams without production logs should combine expert-authored prompts with agentic generation. Domain experts can draft high-stakes edge cases, while agentic red teaming tools can compress weeks of manual probing into hours, surfacing hundreds of vulnerabilities that human researchers might overlook. In [HumanSignal], you can coordinate these expert annotation cycles to verify that synthetic prompts meet your safety standards.