Coaching Diffusion Fashions with Reinforcement Studying

Diffusion fashions have lately emerged because the de facto customary for producing complicated, high-dimensional outputs. You might know them for his or her capability to supply stunning AI art and hyper-realistic synthetic images, however they’ve additionally discovered success in different functions reminiscent of drug design and continuous control. The important thing thought behind diffusion fashions is to iteratively remodel random noise right into a pattern, reminiscent of a picture or protein construction. That is usually motivated as a maximum likelihood estimation downside, the place the mannequin is educated to generate samples that match the coaching knowledge as intently as potential.

Nonetheless, most use circumstances of diffusion fashions usually are not immediately involved with matching the coaching knowledge, however as an alternative with a downstream goal. We don’t simply need a picture that appears like current pictures, however one which has a particular sort of look; we don’t simply need a drug molecule that’s bodily believable, however one that’s as efficient as potential. On this submit, we present how diffusion fashions will be educated on these downstream aims immediately utilizing reinforcement studying (RL). To do that, we finetune Stable Diffusion on quite a lot of aims, together with picture compressibility, human-perceived aesthetic high quality, and prompt-image alignment. The final of those aims makes use of suggestions from a large vision-language model to enhance the mannequin’s efficiency on uncommon prompts, demonstrating how powerful AI models can be used to improve each other with none people within the loop.

A diagram illustrating the prompt-image alignment goal. It makes use of LLaVA, a big vision-language mannequin, to judge generated pictures.

Denoising Diffusion Coverage Optimization

When turning diffusion into an RL downside, we make solely essentially the most fundamental assumption: given a pattern (e.g. a picture), we now have entry to a reward perform that we will consider to inform us how “good” that pattern is. Our purpose is for the diffusion mannequin to generate samples that maximize this reward perform.

Diffusion fashions are usually educated utilizing a loss perform derived from most chance estimation (MLE), which means they’re inspired to generate samples that make the coaching knowledge look extra possible. Within the RL setting, we now not have coaching knowledge, solely samples from the diffusion mannequin and their related rewards. A method we will nonetheless use the identical MLE-motivated loss perform is by treating the samples as coaching knowledge and incorporating the rewards by weighting the loss for every pattern by its reward. This offers us an algorithm that we name reward-weighted regression (RWR), after existing algorithms from RL literature.

Nonetheless, there are a number of issues with this method. One is that RWR isn’t a very actual algorithm — it maximizes the reward solely roughly (see Nair et. al., Appendix A). The MLE-inspired loss for diffusion can also be not actual and is as an alternative derived utilizing a variational bound on the true chance of every pattern. Which means RWR maximizes the reward by two ranges of approximation, which we discover considerably hurts its efficiency.

We consider two variants of DDPO and two variants of RWR on three reward capabilities and discover that DDPO persistently achieves the very best efficiency.

The important thing perception of our algorithm, which we name denoising diffusion coverage optimization (DDPO), is that we will higher maximize the reward of the ultimate pattern if we take note of your complete sequence of denoising steps that received us there. To do that, we reframe the diffusion course of as a multi-step Markov decision process (MDP). In MDP terminology: every denoising step is an motion, and the agent solely will get a reward on the ultimate step of every denoising trajectory when the ultimate pattern is produced. This framework permits us to use many highly effective algorithms from RL literature which are designed particularly for multi-step MDPs. As a substitute of utilizing the approximate chance of the ultimate pattern, these algorithms use the precise chance of every denoising step, which is extraordinarily straightforward to compute.

We selected to use coverage gradient algorithms as a result of their ease of implementation and past success in language model finetuning. This led to 2 variants of DDPO: DDPO_SF, which makes use of the easy rating perform estimator of the coverage gradient also referred to as REINFORCE; and DDPO_IS, which makes use of a extra highly effective significance sampled estimator. DDPO_IS is our best-performing algorithm and its implementation intently follows that of proximal policy optimization (PPO).

Finetuning Steady Diffusion Utilizing DDPO

For our most important outcomes, we finetune Stable Diffusion v1-4 utilizing DDPO_IS. We’ve 4 duties, every outlined by a distinct reward perform:

Compressibility: How straightforward is the picture to compress utilizing the JPEG algorithm? The reward is the unfavourable file dimension of the picture (in kB) when saved as a JPEG.
Incompressibility: How arduous is the picture to compress utilizing the JPEG algorithm? The reward is the constructive file dimension of the picture (in kB) when saved as a JPEG.
Aesthetic High quality: How aesthetically interesting is the picture to the human eye? The reward is the output of the LAION aesthetic predictor, which is a neural community educated on human preferences.
Immediate-Picture Alignment: How effectively does the picture characterize what was requested for within the immediate? This one is a little more difficult: we feed the picture into LLaVA, ask it to explain the picture, after which compute the similarity between that description and the unique immediate utilizing BERTScore.

Since Steady Diffusion is a text-to-image mannequin, we additionally want to choose a set of prompts to present it throughout finetuning. For the primary three duties, we use easy prompts of the shape “a(n) [animal]”. For prompt-image alignment, we use prompts of the shape “a(n) [animal] [activity]”, the place the actions are “washing dishes”, “enjoying chess”, and “using a motorcycle”. We discovered that Steady Diffusion usually struggled to supply pictures that matched the immediate for these uncommon situations, leaving loads of room for enchancment with RL finetuning.

First, we illustrate the efficiency of DDPO on the easy rewards (compressibility, incompressibility, and aesthetic high quality). The entire pictures are generated with the identical random seed. Within the prime left quadrant, we illustrate what “vanilla” Steady Diffusion generates for 9 totally different animals; the entire RL-finetuned fashions present a transparent qualitative distinction. Curiously, the aesthetic high quality mannequin (prime proper) tends in the direction of minimalist black-and-white line drawings, revealing the sorts of pictures that the LAION aesthetic predictor considers “extra aesthetic”.

Subsequent, we reveal DDPO on the extra complicated prompt-image alignment process. Right here, we present a number of snapshots from the coaching course of: every sequence of three pictures exhibits samples for a similar immediate and random seed over time, with the primary pattern coming from vanilla Steady Diffusion. Curiously, the mannequin shifts in the direction of a extra cartoon-like fashion, which was not intentional. We hypothesize that it is because animals doing human-like actions usually tend to seem in a cartoon-like fashion within the pretraining knowledge, so the mannequin shifts in the direction of this fashion to extra simply align with the immediate by leveraging what it already is aware of.

Sudden Generalization

Stunning generalization has been discovered to come up when finetuning giant language fashions with RL: for instance, fashions finetuned on instruction-following solely in English often improve in other languages. We discover that the identical phenomenon happens with text-to-image diffusion fashions. For instance, our aesthetic high quality mannequin was finetuned utilizing prompts that had been chosen from a listing of 45 frequent animals. We discover that it generalizes not solely to unseen animals but in addition to on a regular basis objects.

Our prompt-image alignment mannequin used the identical record of 45 frequent animals throughout coaching, and solely three actions. We discover that it generalizes not solely to unseen animals but in addition to unseen actions, and even novel mixtures of the 2.

Overoptimization

It’s well-known that finetuning on a reward perform, particularly a realized one, can result in reward overoptimization the place the mannequin exploits the reward perform to attain a excessive reward in a non-useful manner. Our setting is not any exception: in all of the duties, the mannequin ultimately destroys any significant picture content material to maximise reward.

We additionally found that LLaVA is inclined to typographic assaults: when optimizing for alignment with respect to prompts of the shape “[n] animals”, DDPO was in a position to efficiently idiot LLaVA by as an alternative producing textual content loosely resembling the right quantity.

There may be at the moment no general-purpose methodology for stopping overoptimization, and we spotlight this downside as an necessary space for future work.

Conclusion

Diffusion fashions are arduous to beat in relation to producing complicated, high-dimensional outputs. Nonetheless, to date they’ve largely been profitable in functions the place the purpose is to be taught patterns from tons and plenty of knowledge (for instance, image-caption pairs). What we’ve discovered is a strategy to successfully practice diffusion fashions in a manner that goes past pattern-matching — and with out essentially requiring any coaching knowledge. The probabilities are restricted solely by the standard and creativity of your reward perform.

The best way we used DDPO on this work is impressed by the current successes of language mannequin finetuning. OpenAI’s GPT fashions, like Steady Diffusion, are first educated on big quantities of Web knowledge; they’re then finetuned with RL to supply helpful instruments like ChatGPT. Sometimes, their reward perform is realized from human preferences, however others have extra recently discovered tips on how to produce highly effective chatbots utilizing reward capabilities based mostly on AI suggestions as an alternative. In comparison with the chatbot regime, our experiments are small-scale and restricted in scope. However contemplating the large success of this “pretrain + finetune” paradigm in language modeling, it definitely looks as if it’s price pursuing additional on the earth of diffusion fashions. We hope that others can construct on our work to enhance giant diffusion fashions, not only for text-to-image era, however for a lot of thrilling functions reminiscent of video generation, music generation, image editing, protein synthesis, robotics, and extra.

Moreover, the “pretrain + finetune” paradigm isn’t the one manner to make use of DDPO. So long as you could have a great reward perform, there’s nothing stopping you from coaching with RL from the beginning. Whereas this setting is as-yet unexplored, it is a place the place the strengths of DDPO may actually shine. Pure RL has lengthy been utilized to all kinds of domains starting from playing games to robotic manipulation to nuclear fusion to chip design. Including the highly effective expressivity of diffusion fashions to the combination has the potential to take current functions of RL to the following stage — and even to find new ones.

This submit relies on the next paper:

If you wish to be taught extra about DDPO, you possibly can try the paper, website, original code, or get the model weights on Hugging Face. If you wish to use DDPO in your personal venture, try my PyTorch + LoRA implementation the place you possibly can finetune Steady Diffusion with lower than 10GB of GPU reminiscence!

If DDPO evokes your work, please cite it with:

@misc{black2023ddpo,
      title={Coaching Diffusion Fashions with Reinforcement Studying}, 
      writer={Kevin Black and Michael Janner and Yilun Du and Ilya Kostrikov and Sergey Levine},
      12 months={2023},
      eprint={2305.13301},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}