Adversarial testing for generative AI safety – Google Research Blog

Posted by Kathy Meier-Hellstern, Building Responsible AI & Data Systems, Director, Google Research

The Responsible AI and Human-Centered Technology (RAI-HCT) team within Google Research is committed to advancing the theory and practice of responsible human-centered AI through a lens of culturally-aware research, to meet the needs of billions of users today, and blaze the path forward for a better AI future. The BRAIDS (Building Responsible AI Data and Solutions) team within RAI-HCT aims to simplify the adoption of RAI practices through the utilization of scalable tools, high-quality data, streamlined processes, and novel research with a current emphasis on addressing the unique challenges posed by generative AI (GenAI).

GenAI models have enabled unprecedented capabilities leading to a rapid surge of innovative applications. Google actively leverages GenAI to enhance its products’ utility and to improve lives. While enormously beneficial, GenAI also presents risks for disinformation, bias, and security. In 2018, Google pioneered the AI Principles, emphasizing beneficial use and prevention of harm. Since then, Google has focused on effectively implementing our principles in Responsible AI practices through 1) a comprehensive risk assessment framework, 2) internal governance structures, 3) education, empowering Googlers to integrate AI Principles into their work, and 4) the development of processes and tools that identify, measure, and analyze ethical risks throughout the lifecycle of AI-powered products. The BRAIDS team focuses on the last area, creating tools and techniques for identification of ethical and safety risks in GenAI products that enable teams within Google to apply appropriate mitigations.

What makes GenAI challenging to build responsibly?

The unprecedented capabilities of GenAI models have been accompanied by a new spectrum of potential failures, underscoring the urgency for a comprehensive and systematic RAI approach to understanding and mitigating potential safety concerns before the model is made broadly available. One key technique used to understand potential risks is adversarial testing, which is testing performed to systematically evaluate the models to learn how they behave when provided with malicious or inadvertently harmful inputs across a range of scenarios. To that end, our research has focused on three directions:

Scaled adversarial data generation
Given the diverse user communities, use cases, and behaviors, it is difficult to comprehensively identify critical safety issues prior to launching a product or service. Scaled adversarial data generation with humans-in-the-loop addresses this need by creating test sets that contain a wide range of diverse and potentially unsafe model inputs that stress the model capabilities under adverse circumstances. Our unique focus in BRAIDS lies in identifying societal harms to the diverse user communities impacted by our models.
Automated test set evaluation and community engagement
Scaling the testing process so that many thousands of model responses can be quickly evaluated to learn how the model responds across a wide range of potentially harmful scenarios is aided with automated test set evaluation. Beyond testing with adversarial test sets, community engagement is a key component of our approach to identify “unknown unknowns” and to seed the data generation process.
Rater diversity
Safety evaluations rely on human judgment, which is shaped by community and culture and is not easily automated. To address this, we prioritize research on rater diversity.

Scaled adversarial data generation

High-quality, comprehensive data underpins many key programs across Google. Initially reliant on manual data generation, we’ve made significant strides to automate the adversarial data generation process. A centralized data repository with use-case and policy-aligned prompts is available to jump-start the generation of new adversarial tests. We have also developed multiple synthetic data generation tools based on large language models (LLMs) that prioritize the generation of data sets that reflect diverse societal contexts and that integrate data quality metrics for improved dataset quality and diversity.

Our data quality metrics include:

Analysis of language styles, including query length, query similarity, and diversity of language styles.
Measurement across a wide range of societal and multicultural dimensions, leveraging datasets such as SeeGULL, SPICE, the Societal Context Repository.
Measurement of alignment with Google’s generative AI policies and intended use cases.
Analysis of adversariality to ensure that we examine both explicit (the input is clearly designed to produce an unsafe output) and implicit (where the input is innocuous but the output is harmful) queries.

One of our approaches to scaled data generation is exemplified in our paper on AI-Assisted Red Teaming (AART). AART generates evaluation datasets with high diversity (e.g., sensitive and harmful concepts specific to a wide range of cultural and geographic regions), steered by AI-assisted recipes to define, scope and prioritize diversity within an application context. Compared to some state-of-the-art tools, AART shows promising results in terms of concept coverage and data quality. Separately, we are also working with MLCommons to contribute to public benchmarks for AI Safety.

Adversarial testing and community insights

Evaluating model output with adversarial test sets allows us to identify critical safety issues prior to deployment. Our initial evaluations relied exclusively on human ratings, which resulted in slow turnaround times and inconsistencies due to a lack of standardized safety definitions and policies. We have improved the quality of evaluations by introducing policy-aligned rater guidelines to improve human rater accuracy, and are researching additional improvements to better reflect the perspectives of diverse communities. Additionally, automated test set evaluation using LLM-based auto-raters enables efficiency and scaling, while allowing us to direct complex or ambiguous cases to humans for expert rating.

Beyond testing with adversarial test sets, gathering community insights is vital for continuously discovering “unknown unknowns”. To provide high quality human input that is required to seed the scaled processes, we partner with groups such as the Equitable AI Research Round Table (EARR), and with our internal ethics and analysis teams to ensure that we are representing the diverse communities who use our models. The Adversarial Nibbler Challenge engages external users to understand potential harms of unsafe, biased or violent outputs to end users at scale. Our continuous commitment to community engagement includes gathering feedback from diverse communities and collaborating with the research community, for example during The ART of Safety workshop at the Asia-Pacific Chapter of the Association for Computational Linguistics Conference (IJCNLP-AACL 2023) to address adversarial testing challenges for GenAI.

Rater diversity in safety evaluation

Understanding and mitigating GenAI safety risks is both a technical and social challenge. Safety perceptions are intrinsically subjective and influenced by a wide range of intersecting factors. Our in-depth study on demographic influences on safety perceptions explored the intersectional effects of rater demographics (e.g., race/ethnicity, gender, age) and content characteristics (e.g., degree of harm) on safety assessments of GenAI outputs. Traditional approaches largely ignore inherent subjectivity and the systematic disagreements among raters, which can mask important cultural differences. Our disagreement analysis framework surfaced a variety of disagreement patterns between raters from diverse backgrounds including also with “ground truth” expert ratings. This paves the way to new approaches for assessing quality of human annotation and model evaluations beyond the simplistic use of gold labels. Our NeurIPS 2023 publication introduces the DICES (Diversity In Conversational AI Evaluation for Safety) dataset that facilitates nuanced safety evaluation of LLMs and accounts for variance, ambiguity, and diversity in various cultural contexts.

Summary

GenAI has resulted in a technology transformation, opening possibilities for rapid development and customization even without coding. However, it also comes with a risk of generating harmful outputs. Our proactive adversarial testing program identifies and mitigates GenAI risks to ensure inclusive model behavior. Adversarial testing and red teaming are essential components of a Safety strategy, and conducting them in a comprehensive manner is essential. The rapid pace of innovation demands that we constantly challenge ourselves to find “unknown unknowns” in cooperation with our internal partners, diverse user communities, and other industry experts.