Purple Teaming Language Fashions with Language Fashions

In our recent paper, we present that it’s doable to robotically discover inputs that elicit dangerous textual content from language fashions by producing inputs utilizing language fashions themselves. Our strategy gives one software for locating dangerous mannequin behaviours earlier than customers are impacted, although we emphasize that it needs to be considered as one part alongside many different strategies that shall be wanted to search out harms and mitigate them as soon as discovered.

Massive generative language fashions like GPT-3 and Gopher have a outstanding potential to generate high-quality textual content, however they’re troublesome to deploy in the actual world. Generative language fashions include a danger of producing very dangerous textual content, and even a small danger of hurt is unacceptable in real-world purposes.

For instance, in 2016, Microsoft launched the Tay Twitter bot to robotically tweet in response to customers. Inside 16 hours, Microsoft took Tay down after a number of adversarial customers elicited racist and sexually-charged tweets from Tay, which had been despatched to over 50,000 followers. The result was not for lack of care on Microsoft’s part:

“Though we had ready for a lot of varieties of abuses of the system, we had made a important oversight for this particular assault.”

Peter Lee
VP, Microsoft

The problem is that there are such a lot of doable inputs that may trigger a mannequin to generate dangerous textual content. In consequence, it’s onerous to search out the entire instances the place a mannequin fails earlier than it’s deployed in the actual world. Earlier work depends on paid, human annotators to manually uncover failure instances (Xu et al. 2021, inter alia). This strategy is efficient however costly, limiting the quantity and variety of failure instances discovered.

We goal to enhance guide testing and scale back the variety of important oversights by discovering failure instances (or ‘crimson teaming’) in an computerized manner. To take action, we generate check instances utilizing a language mannequin itself and use a classifier to detect varied dangerous behaviors on check instances, as proven under:

Our strategy uncovers a wide range of dangerous mannequin behaviors:

  1. Offensive Language: Hate speech, profanity, sexual content material, discrimination, and so on.
  2. Knowledge Leakage: Producing copyrighted or personal, personally-identifiable info from the coaching corpus.
  3. Contact Data Technology: Directing customers to unnecessarily electronic mail or name actual folks.
  4. Distributional Bias: Speaking about some teams of individuals in an unfairly totally different manner than different teams, on common over a lot of outputs.
  5. Conversational Harms: Offensive language that happens within the context of an extended dialogue, for instance.

To generate check instances with language fashions, we discover a wide range of strategies, starting from prompt-based era and few-shot studying to supervised finetuning and reinforcement studying. Some strategies generate extra numerous check instances, whereas different strategies generate harder check instances for the goal mannequin. Collectively, the strategies we suggest are helpful for acquiring excessive check protection whereas additionally modeling adversarial instances.

As soon as we discover failure instances, it turns into simpler to repair dangerous mannequin conduct by:

  1. Blacklisting sure phrases that steadily happen in dangerous outputs, stopping the mannequin from producing outputs that comprise high-risk phrases.
  2. Discovering offensive coaching information quoted by the mannequin, to take away that information when coaching future iterations of the mannequin.
  3. Augmenting the mannequin’s immediate (conditioning textual content) with an instance of the specified conduct for a sure sort of enter, as proven in our recent work.
  4. Coaching the mannequin to minimize the likelihood of its authentic, dangerous output for a given check enter.

Total, language fashions are a extremely efficient software for uncovering when language fashions behave in a wide range of undesirable methods. In our present work, we centered on crimson teaming harms that at the moment’s language fashions commit. Sooner or later, our strategy will also be used to preemptively uncover different, hypothesized harms from superior machine studying programs, e.g., resulting from inner misalignment or failures in objective robustness. This strategy is only one part of accountable language mannequin improvement: we view crimson teaming as one software for use alongside many others, each to search out harms in language fashions and to mitigate them. We confer with Part 7.3 of Rae et al. 2021 for a broader dialogue of different work wanted for language mannequin security.

For extra particulars on our strategy and outcomes, in addition to the broader penalties of our findings, learn our red teaming paper right here.

MuZero’s first step from analysis into the true world

The Podcast returns for Season 2