The NExT Research Center at the National University of Singapore (NUS) recently open-sourced NExT-GPT, an “any-to-any” multi-modal large language model (LLM) that can handle text, images, videos, and audio as input or output. NExT-GPT is based on existing pre-trained models and only required updating 1% of its total parameters during training.
NExT-GPT has a chat-based interface where users can type text or upload image, video, or audio files. The model can understand the content of the input and answer questions about it, or generate text, image, video, or audio in response to user requests. The system is based on open-source pre-trained encoders and decoders, including Vicuna and Stable Diffusion, with trainable neural network layers in between. Those layers are trained using a technique that the NExT team invented called Modality-switching Instruction Tuning (MosIT). According to the NExT team:
Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.
Multi-modal AI based on LLMs is an active research area. In 2022, InfoQ covered DeepMind’s Flamingo, which combines separately pre-trained vision and language models and can answer questions about input images and videos. Earlier this year, InfoQ covered OpenAI’s GPT-4, which can handle image input, as well as two vision-language models from Microsoft: Visual ChatGPT, which uses ChatGPT to invoke different visual foundation models to perform tasks; and LLaVA, which combines CLIP for vision and LLaMA for language with an additional network layer to tie the two together and is trained end-to-end on visual instruction tuning.
NExT-GPT Architecture (Source: NExT-GPT Project Page)
NExT-GPT’s architecture consists of three tiers: an encoding stage plus linear projections, to map the various input modalities into a single space; a Vicuna LLM “core,” which generates tokens, including signal tokens indicating which output modality to use; and a decoding stage consisting of modality-specific transformer layers plus decoders. Because the encoders, decoders, and Vicuna model are frozen, only about 1% of the model’s total parameters are updated during training, which saves on training costs.
Like Microsoft’s LLaVA, NExT-GPT is trained using instruction-tuning. For NExT-GPT, the NUS team crafted a dataset of example dialogues between a human user and a chatbot. The dialogues contain three to seven turns, and cover scenarios with multiple modalities in both the input and output. The team used ChatGPT and other generative AI models including Midjourney to help produce the examples. Overall, the dataset contains around 5k dialogues.
The researchers evaluated NExT-GPT on several multi-modal generation benchmarks. While not achieving state-of-the-art results, the model did perform “on par” with baseline models. The team also asked human judges to score the model’s output in several scenarios, ranking it on a scale of 1 to 10. The judges rated the results higher for image generation scenarios, compared to video and audio.
Several users asked technical questions about NExT-GPT on in a Huggingface discussion thread. One user asked how the model generated the modality-signaling tokens. NExT-GPT lead author Shengqiong Wu replied,
[T]he special tokens are generated by LLM when users ask NExT-GPT to show images, videos, or sounds. Actually, during the training, we insert the pre-defined special tokens into the vocabulary of LLM. For more details, please check the code.
The NExT-GPT code and model files are available on Github. An interactive demo is also available on the web.