Red Teaming Challenges
The Chief Digital and Artificial Intelligence Office (CDAO) + Humane Intelligence: Advancing AI Assurance in Military Medicine through Crowdsourced Red Teaming
January 2, 2025
The Chief Digital and Artificial Intelligence Office (CDAO) has successfully concluded a Crowdsourced AI Red-Teaming (CAIRT) Assurance Program pilot focused on the use of Large-Language Model (LLM) chatbots in the context of military medicine. The CAIRT program supports the Department of Defense (DoD) in generating grassroots, crowdsourced approaches to AI Assurance and AI Risk Mitigation. Through crowdsourcing, projects are able to elicit a large volume of data and involve a wide variety of stakeholders.
This CAIRT LLM pilot was conducted by Humane Intelligence, a tech company building a community of practice around algorithmic evaluations, in collaboration with the Defense Health Agency (DHA) and the Program Executive Office, Defense Healthcare Management Systems (PEO DHMS). Through red-teaming methodology―using adversarial techniques to internally test system robustness―Humane Intelligence was able to effectively detect specific system vulnerabilities. Additionally, red-teaming attracts participants who want to engage with new technologies and, as possible future beneficiaries, gain the opportunity to contribute to improving the systems. Previously, in the spring of 2024, the CDAO held a valuable red-teaming CAIRT exercise utilizing a financial bounty.
In the latest pilot program, Humane Intelligence utilized crowdsourced red-teaming for two prospective use cases in the context of military medicine: clinical note summarization and a medical advisory chatbot. Over 200 participants, including clinical providers and healthcare analysts from DHA, the Uniformed Services University of the Health Sciences, and the Services, participated in the exercise, which compared three popular LLMs. The exercise uncovered over 800 findings of potential vulnerabilities and biases related to employing these capabilities in these prospective use cases. This exercise will result in repeatable and scalable output via the development of benchmark datasets, which can be used to evaluate future vendors and tools for alignment with performance expectations. Furthermore, these findings will play a crucial role in shaping DoD policies and best practices for responsible use of Generative AI (GenAI), ultimately improving military medical care. If, when fielded, these prospective use cases comprise covered AI defined in OMB M-24-10, they will adhere to all required risk management practices.
“Since applying GenAI for such purposes within the DoD is in earlier stages of piloting and experimentation, this program acts as an essential pathfinder for generating a mass of testing data, surfacing areas for consideration, and validating mitigation options that will shape future research, development, and assurance of GenAI systems that may be deployed in the future,” remarked CDAO’s lead for this initiative, Dr. Matthew Johnson.
As the recent pilot and others have revealed, continued testing of LLMs and AI systems through the CAIRT Assurance Program will be critical to accelerating the CDAO’s AI Rapid Capabilities Cell, improving GenAI mission effectiveness, and contributing to justified confidence across DoD use cases.