Red Teaming Challenges

Royal Society Red Teaming Exercise

October 25, 2023

In an ornate room lined with marble busts of famous scientists, around 40 experts in climate science and disease were hunched over their laptops on October 25, 2023, coaxing a powerful AI system into generating misinformation.

By the end of the day, attendees had managed to overcome the guardrails on the AI system— Meta’s Llama 2—and got it to argue that ducks could absorb air pollution, to say that garlic and “miraculous herbs” could help prevent COVID-19 infection, to generate libelous information about a specific climate scientist, and to encourage children to take a vaccine not recommended for them.

The event, held under a gilded ceiling at the prestigious Royal Society in London, was co-organized by Humane Intelligence and highlighted the ways that the world’s most cutting-edge AI systems are still vulnerable to abuse. It came just one week ahead of the world’s first AI Safety Summit, organized by the U.K. government, where global policymakers will convene with AI scientists to discuss the dangers of the fast-moving technology.

The event was carried out in participation with Meta, which sent an observer to the event and said it would use the findings to strengthen the guardrails of its AI systems.

Building better safety guardrails

Large language models (LLMs,) the AI systems that power AI chatbots like ChatGPT, usually come with guardrails to prevent generating unsavory or dangerous content—whether that’s misinformation, sexually explicit material, or advice on how to build bioweaponry or malware. But these guardrails have sometimes proved brittle. Computer scientists and hackers have repeatedly shown it is possible to “jailbreak” LLMs—that is, get around their safety features—by prompting them in creative ways. According to critics, these vulnerabilities show the limitations of so-called AI alignment, the nascent practice of ensuring AIs only act in ways that their creators intend.

The tech companies behind LLMs often patch vulnerabilities when they become known. To speed up this process, AI labs have begun encouraging a process known as red-teaming—where experts try their hardest to jailbreak LLMs so that their vulnerabilities can be patched. In September, OpenAI launched a “Red Teaming Network” of experts to stress-test its systems. And yesterday the Frontier Model Forum, an industry group set up by Microsoft, OpenAI, Google, and Anthropic, announced a $10 million AI Safety Fund to fund safety research, including red-teaming efforts.

Attendees at the London red-teaming event managed to get Llama 2 to generate misleading news articles and tweets containing conspiracy theories worded to appeal to specific audiences, demonstrating how AI systems can be used to not only generate misinformation, but successfully devise ways to spread it more widely.

Bethan Cracknell Daniels, an expert in dengue fever at Imperial College London who attended the event, successfully prompted the model to generate an ad campaign encouraging all children to get the dengue vaccine—in spite of the fact that the vaccine is not recommended for individuals who have not previously had the disease. The model also fabricated data to support a misleading claim that the vaccine is entirely safe and has performed well in real world settings.

Large language models have previously been shown to be vulnerable to “adversarial attacks,” where motivated bad actors can, for example, add a specific long string of characters to the end of a prompt in order to jailbreak certain models. The red teaming event, however, was focused on different kinds of vulnerabilities more applicable to everyday users.

Attendees agreed, before starting, to a rule that they would “do no harm” with the information they learned at the event.

Return To Red Teaming Events

Stay in touch!

Sign up to stay up to date on upcoming challenges, events and to receive our newsletter.

humane intelligence green-withbrand-mark bg.png

Support our work.

We welcome event sponsorships and donations.