READ THE REPORT
Check out the data directly! You are welcome to explore the dataset directly, hosted on GitHub. The data is accessible here.
Generative AI Red Teaming Challenge
Overall Summary
Humane Intelligence, a tech nonprofit dedicated to building community around algorithmic assessment, is publishing the findings from the largest-ever Generative AI Public red teaming event for closed-source API models. This event was developed in collaboration with Seed AI and DEFCON AI Village, and held at DEFCON 2023. Over 2.5 days, 2,244 hackers evaluated 8 LLMs and produced over 17,000 conversations on 21 topics ranging from cybersecurity hacks to misinformation and human rights. Our winners received a GPU provided by our partners at NVIDIA.
Our event and analysis, the first of their kind, studies the performance of eight state-of-the-art large large language models (LLMs) by approximating at-scale real-world scenarios where harmful outcomes may occur.
The largest ever generative AI red teaming exercise!
Participating companies were:
Anthropic, Cohere, Google, Hugging Face, Meta, Nvidia, OpenAI, Scale AI, Stability.ai.
Policy partners included:
White House Office of Science and Technology Policy, National Science Foundation, Congressional Artificial Intelligence Caucus, National Institute of Standards and Technology.
Community and Expert partners are:
Houston Community College, Black Tech Street, AVID, Wilson Center, Taraaz, MITRE.
“Red teaming is a key part of Meta's responsible approach to AI and we appreciate the opportunity to participate in DEFCON 31’s GRT event with a diverse group of external experts,” commented Meta, a participating company.
This initiative has opened new doors for external entities, including government and civil society groups, to engage in the practice of red teaming—traditionally a closed-door exercise among major AI labs—thereby offering fresh perspectives on the oversight and improvement of AI technologies.
“As access to AI grows, its societal impacts will also grow. However, the demographics of AI labs do not reflect the broader population. Nor do we believe that AI developers should be solely or primarily responsible for determining the values that guide the behavior of AI systems...As the technology advances, it will be crucial that people from all backgrounds can help ensure that AI aligns with their values,” commented Anthropic, a participating company.
Government or civil society entities can utilize public red teaming as a practice to create smarter policies and evidence-based guidance, regulation and standards. Red teaming outside of companies serves a different purpose from red teaming at companies and should seek to augment, not replace or compete with existing corporate red teaming practices. We demonstrated how these types of exercises can be used to operationalize a set of values, the White House Office of Science and Technology Policy Blueprint for an AI Bill of Rights. We are grateful for their collaboration..
“As a non-profit organization working at the intersection of technology and human rights, we find the red teaming dataset helpful to further our mission in researching and advocating for human rights in the digital age,” commented Taraaz, a community partner.
Red teaming models for biases and other social harms is difficult as their context can make them hard to define. Methods of structured public feedback, such as public red teaming, engages a larger audience in order to gather more nuance.
“We must continue finding ways to make red teaming inclusive beyond just industry, and inspire the next generation of red teamers. Ideally, this would combine subject matter expertise with the pluralistic and diverse perspectives of general society. Our hope at Anthropic is that the GRT Challenge and events like it will get a wider group of people excited about model safety and AI red teaming as a career,” commented Anthropic.
Our analysis divided the questions into four broad categories: Factuality, Bias, Misdirection, and Cybersecurity.
Key findings from the data:
-
The most successful strategies were ones that are hard to distinguish from traditional prompt engineering, emphasizing the dual nature of this technology. Asking the model to role play, or ‘write a story’ were successful. In addition, the user acting authoritatively on a topic could engineer the model to provide ‘agreeable’ output, even if incorrect.
-
Human behavior can inadvertently result in biased outcomes. People interact with language models in a more conversational manner than with search engines. As a result, methods of social engineering used by hackers are similar to the ‘natural’ and ‘conversational’ way people interact with LLMs - where they share their preferences or personal details to provide context. In other words, innocent actors may accidentally socially engineer the model to give them the answer they want to hear, rather than a factual answer.
-
Unlike other algorithmic systems - notably social media models - the LLMs did not further radicalize users when provided with aggressive content. In most cases, it matched the harmfulness of the user query, which can result in reinforcing their world view. In a few cases, the model even de-escalated.
“Alongside our industry peers, data and insights garnered from our participation at DEFCON were used to help inform AI safety strategies for existing models and current protections. We gained new perspectives from the event’s activities that complemented our existing red teaming practices and expertise. As the industry looks to replicate similar live exercises in the future, we see opportunities targeted for more specific audiences to bring pointed focus,” commented Google, an event partner.
In the spirit of open science, Humane Intelligence is sharing the full anonymized dataset as well as analysis code on our GitHub repository. In addition, 11 research organizations were granted early access to the dataset to conduct their own analysis.
The full report is available here.
Questions can be directed to press@humane-intelligence.org.