Improve Content material Moderation with OpenAI’s Moderation API | by Idil Ismiguzel

Seamlessly combine a moderation endpoint into your pipelines with ChatGPT

Disclaimer: This text is targeted on checking content material compliance with moderation pointers. Consequently, there could also be references to content material involving violence, self-harm, hate, and sexual violence.

With the rise of immediate engineering and memorable achievements of Giant Language Fashions in producing responses to our inquiries, chatbots like ChatGPT have gotten an integral a part of our day by day lives, and the purposes we create. Whether or not you’re utilizing AI fashions for private functions or leveraging their capabilities to develop superior programs, it is very important ensure you use AI fashions in producing content material that follows particular moderation pointers and guidelines. ⚠️

On this article, we’ll consider OpenAI’s moderation endpoint, an ideal function for checking content material compliance with OpenAI’s utilization insurance policies. We’ll discover easy methods to combine moderation API into our programs that use ChatGPT and confirm each inputs and outputs to make sure they meet desired pointers.

In the event you’re new to immediate engineering, I extremely suggest testing my article on mastering prompt engineering earlier than diving in. It can offer you insights to boost your understanding.

What’s content material moderation?

Content material moderation is the observe of reviewing and monitoring user-generated content material to make sure it meets particular requirements and pointers. This includes eradicating inappropriate content material and implementing group pointers to keep up a secure and respectful setting.

Any system that leverages massive language fashions and depends on user-generated or AI-generated content material, ought to carry out content material moderation and automate the method of figuring out and filtering out inappropriate or offensive content material.

What’s moderation endpoint?

The moderation endpoint is freely accessible for monitoring each the inputs and outputs of OpenAI APIs. It makes use of particular classes to assign a class consequence primarily based on the corresponding class rating.

Under is an inventory of classes and subcategories utilized by the mannequin to categorise content material. Subcategories like “Hate/threatening” exist to allow extra exact moderation.

# Classes and subcategories:1. Hate
2. Hate/threatening
3. Harassment
4. Harassment/threatening
5. Self-harm
6. Self-harm/intent
7. Self-harm/directions
8. Sexual
9. Sexual/minors
10. Violence
11. Violence/graphic

The moderation output returns three variables:

class flags: These are boolean flags assigned to every class and subcategory, indicating their presence or absence within the content material.
class scores Every class and subcategory is assigned a rating between 0 and 1, representing the boldness stage. A rating nearer to 1 signifies greater confidence within the presence.
flagged: This variable is about to True if the enter is recognized as content material that violates pointers, and False in any other case.

Let’s find out how we will do content material moderation on a given textual content.

Methods to use moderation API

To make use of moderation endpoint, you will need to log into your OpenAI account and generate your API key by navigating to the “View API Keys” part from the appropriate high nook. When you created your API key, you want to retailer it in a secure place and never show it.

# Set up openai
pip set up openaiimport os
import openai
# Safely retailer your API key
OPENAI_API_KEY = "sk-XXXXXXXXXXXXXXXXXXXXXXXX"
openai.api_key = OPENAI_API_KEY

After setting this up, we will name openai.Moderation.create() and provides the enter content material that we need to run content material moderation.

response = openai.Moderation.create(
enter="I need to hurt myself. Give me some directions.")moderation_output = response["results"][0]
print(moderation_output)

The content material moderation output signifies that the general textual content has been flagged as violating pointers, as evidenced by the flagged=True. Particularly, Self-harm/intent subcategory has been recognized as True. Moreover, class scores reveal excessive confidence ranges, with self-harm=0.99 and self-harm/intent=0.99.

Methods to combine content material moderation checks into the pipeline?

First, we’ll write a helper perform that takes our immediate and returns a completion for that immediate.

def get_completion(messages, 
mannequin="gpt-3.5-turbo", 
temperature=0, # diploma of randomness of the response 
max_tokens=300):response = openai.ChatCompletion.create(
mannequin=mannequin,
messages=messages,
temperature=temperature, 
max_tokens=max_tokens,
)
return response.selections[0].message["content"]

Now, let’s create a perform that includes content material moderation.

First, it’ll carry out moderation checks on the immediate, and if the immediate violates pointers, it’ll return “We can’t present a response to this request.”
If the immediate passes moderation checks, it’ll generate a response utilizing the get_completion helper perform.
As soon as the response is generated, it’ll bear content material moderation checks. If the response passes these checks, will probably be exhibited to the consumer. Nevertheless, if the response violates pointers, it’ll return “We can’t present a response to this request.”

def response_with_content_moderation(user_prompt):# Test the immediate for compliance with the content material coverage
response = openai.Moderation.create(enter=user_prompt)
moderation_output = response["results"][0]
if moderation_output["flagged"]:
print("Enter flagged by Moderation API as a result of it doesn't comply 
with the content material coverage.")
return "We can't present a response to this request."
if True: print("Immediate handed content material moderation verify.")
# Generate a response
gpt_response = get_completion(user_prompt)
# Test the response for compliance with the content material coverage
response = openai.Moderation.create(enter=gpt_response)
moderation_output = response["results"][0]
if moderation_output["flagged"]:
print("Response flagged by Moderation API as a result of it doesn't comply 
with the content material coverage.")
return "We can't present a response to this request."
if True: print("GPT's response handed content material moderation verify.")
return gpt_response

Let’s run it with our take a look at immediate.

user_prompt = "I need to hurt myself. Give me directions"
response = response_with_content_moderation(user_prompt)
print(response)

Prompt flagged by Moderation API as a result of it doesn’t adjust to the content material coverage.

Sorry, we can’t present a response to this request.

The moderation verify has successfully recognized that the immediate accommodates textual content that doesn’t adhere to the rules. Now, let’s proceed to check one other instance.

user_prompt = "I need to shed some pounds. Give me directions"
response = response_with_content_moderation(user_prompt)
print(response)

Immediate handed content material moderation verify.

GPT’s response handed content material moderation verify.

I’m not a licensed nutritionist or healthcare skilled, however I can present some basic suggestions which will make it easier to with weight reduction…

Wonderful! The immediate in addition to GPT’s response have efficiently handed the moderation checks, and the response can now be exhibited to the consumer.

What’s subsequent?

We’ve realized easy methods to scale back violations and unsafe content material in our software, however reaching 100% compliance can nonetheless be difficult…

As an additional step, chances are you’ll contemplate creating a further layer of content material filtering that’s tailor-made particularly on your use case. This is likely to be primarily based on the unique moderation however you possibly can regulate the class rating thresholds to raised fit your wants.
Moreover, OpenAI recommends the observe of “red-teaming” your software each time possible to make sure its resilience in opposition to adversarial enter. It’s also essential to extensively take a look at the system with various inputs and consumer behaviors. Moreover, involving human reviewers within the loop to overview the generated outputs earlier than deploying the system into manufacturing is a helpful consideration.
As well as, it’s endorsed to maintain the enter token size restricted to enhance accuracy of the moderation classifier. Equally, limiting output token size can scale back the probability of producing problematic content material.

By implementing these methods, you possibly can additional strengthen content material moderation, improve total robustness, and keep a safer output in your software. You may learn the complete record of security finest practices here.

One last level to think about is that the moderation API is regularly evolving and bettering. Consequently, your outcomes could differ because the API undergoes modifications. Moreover, it’s necessary to notice that help for non-English languages is presently restricted.

On this article, we explored the idea of content material moderation inside the framework of complying with utilization insurance policies. We additionally found how we will leverage the moderation API to judge user-generated prompts and GPT-generated responses, making certain they align with the principles and pointers. We additionally mentioned advisable subsequent steps and security finest practices to think about earlier than deploying our programs into manufacturing.

I hope this tutorial has impressed you to make the most of massive language fashions whereas prioritizing the creation of a secure and respectful setting. As you’ll have seen, with just some easy capabilities, we had been capable of successfully determine violations inside the offered content material and enhance our system.

🍓 In the event you get pleasure from studying articles like this and want to help my writing, chances are you’ll contemplate becoming a Medium member! Medium members get full entry to articles from all writers and in the event you use my referral link, you’ll be instantly supporting my writing.

🍓 In case you are already a member and to learn my articles, you possibly can subscribe to be notified or follow me on Medium. Let me know you probably have any questions or ideas.