Red Teaming Challenges
DEFCON Red Teaming Exercise
August 14, 2023
This event was developed in collaboration with Seed AI and DEFCON AI Village, and held at DEFCON 2023. Over 2.5 days, 2,244 hackers evaluated 8 LLMs and produced over 17,000 conversations on 21 topics ranging from cybersecurity hacks to misinformation and human rights. Our winners received a GPU provided by our partners at NVIDIA.
Our event and analysis, the first of their kind, studies the performance of eight state-of-the-art large language models (LLMs) by approximating at-scale real-world scenarios where harmful outcomes may occur.
Participating companies were: Anthropic, Cohere, Google, Hugging Face, Meta, Nvidia, OpenAI, Scale AI, Stability.ai.
Policy partners included: White House Office of Science and Technology Policy, National Science Foundation, Congressional Artificial Intelligence Caucus, National Institute of Standards and Technology.
Community and Expert partners were: Houston Community College, Black Tech Street, AVID, Wilson Center, Taraaz, MITRE.
This initiative has opened new doors for external entities, including government and civil society groups, to engage in the practice of red teaming—traditionally a closed-door exercise among major AI labs—thereby offering fresh perspectives on the oversight and improvement of AI technologies.
Government or civil society entities can utilize public red teaming as a practice to create smarter policies and evidence-based guidance, regulation and standards. Red teaming outside of companies serves a different purpose from red teaming at companies and should seek to augment, not replace or compete with existing corporate red teaming practices. We demonstrated how these types of exercises can be used to operationalize a set of values, the White House Office of Science and Technology Policy Blueprint for an AI Bill of Rights. We are grateful for their collaboration..
Red teaming models for biases and other social harms is difficult as their context can make them hard to define. Methods of structured public feedback, such as public red teaming, engages a larger audience in order to gather more nuance.
Our analysis divided the questions into four broad categories: Factuality, Bias, Misdirection, and Cybersecurity.
Key findings from the data:
The most successful strategies were ones that are hard to distinguish from traditional prompt engineering, emphasizing the dual nature of this technology. Asking the model to role play, or ‘write a story’ were successful. In addition, the user acting authoritatively on a topic could engineer the model to provide ‘agreeable’ output, even if incorrect.
Human behavior can inadvertently result in biased outcomes. People interact with language models in a more conversational manner than with search engines. As a result, methods of social engineering used by hackers are similar to the ‘natural’ and ‘conversational’ way people interact with LLMs - where they share their preferences or personal details to provide context. In other words, innocent actors may accidentally socially engineer the model to give them the answer they want to hear, rather than a factual answer.
Unlike other algorithmic systems - notably social media models - the LLMs did not further radicalize users when provided with aggressive content. In most cases, it matched the harmfulness of the user query, which can result in reinforcing their world view. In a few cases, the model even de-escalated.
In the spirit of open science, Humane Intelligence is sharing the full anonymized dataset as well as analysis code on our GitHub repository. In addition, 11 research organizations were granted early access to the dataset to conduct their own analysis.
The full report is available here.
Questions can be directed to press@humane-intelligence.org.