RLHF on Google Cloud | Google Cloud Blog

Foundation models are large neural network models that can be adapted with minimal tuning to a wide range of tasks and exhibit impressive capabilities in generating high-quality text, images, speech, code, and more. Enterprises are leveraging foundation models to power different generative AI use cases, such as generating creative blog articles or improving customer support.

But the perception of high-quality results varies. For foundation models to best serve specific needs, organizations need to tune them to appropriately behave and deliver responses. Reinforcement Learning from Human Feedback (RLHF) is a popular method through which foundation models like large language models (LLMs), initially trained on a general corpus of text data, can be aligned to complex human values. In the context of enterprise use cases, RLHF leverages human feedback to help the model produce outputs that meet unique requirements.

What is RLHF?

RLHF tuning consists of two phases: reward modeling and reinforcement learning.

1. Reward modeling
For reward modeling, data is collected in the form of comparisons. First off, we feed the same prompt into one or more LLMs to create multiple responses. Then, we ask human raters to rank these responses from good to bad. We take all the possible pairs between these responses and naturally, within each pair, one response is more preferred than the other. We do this for many prompts, and in this way, we have created the “human preference dataset.”

We train the reward model to act as a scoring function, so that it scores how good a response is for a given prompt. Recall that for each prompt, we have a ranked list of multiple responses. The reward model’s scores need to agree with the ranking as much as possible. We formulate this into a loss function and train the reward model to make reward predictions that are consistent with the ground truth ranking.

2. Reinforcement learning
Once we have a reward model, we can score the quality of any arbitrary <prompt, response> pair. In this step, we need the “prompt dataset,” which only contains the prompt (i.e. it is unlabeled). We draw a prompt from the dataset, use the LLM to generate a response, and use the reward model to score the quality of the response. If the response is high-quality, then all the tokens in the response (conditional on the prompt) are going to be “reinforced,”i.e. they will have a higher probability of being generated in the future. In this way, we can optimize the LLM to generate responses that maximize the reward. This algorithm is known as reinforcement learning (RL).

RLHF tuning requires orchestrating these two phases, handling large-scale distributed training on multi-host TPUs or GPUs using data parallelism and model partitioning, and optimizing for efficient throughput via computational graph compilation. The intensive computation also requires top-notch hardware accelerators for fast training. Vertex AI customers can implement RLHF using a Vertex AI Pipeline that encapsulates the RLHF algorithm to tune PaLM 2, FLAN-T5 and Llama 2 models. This helps to marry the LLM with the enterprise’s nuanced preferences and values for specific use cases.