AI Phone — A Battle of Multimodal Fashions | by Jacob Marks, Ph.D. | Jun, 2023

DALL-E2, Steady Diffusion, BLIP, and extra!

14 min learn

12 hours in the past

Inventive rendering of a sport of AI Phone. Picture generated by the writer utilizing DALL-E2.

Generative AI is on fireplace proper now. The previous few months particularly have seen an explosion in multimodal machine studying — AI that connects ideas throughout completely different “modalities” similar to textual content, photos, and audio. For example, Midjourney is a multimodal text-to-image mannequin, as a result of it takes in pure language, and outputs photos. The magnum opus for this latest renaissance in multimodal synergy was Meta AI’s ImageBind, which might take inputs of 6(!) varieties and signify them in the identical “area”.

With all of this pleasure, I needed to place multimodal fashions to the check and see how good they really are. Particularly, I needed to reply three questions:

  1. Which text-to-image mannequin is the perfect?
  2. Which image-to-text mannequin is the perfect?
  3. What’s extra necessary — image-to-text, or text-to-image?

After all, every mannequin brings its personal biases to the desk, from coaching information to mannequin structure, so there isn’t actually ever one BEST mannequin. However we will nonetheless put fashions to the check in a normal context!

To reply these questions, I made a decision to play a sport of AI Phone, impressed by the board sport Telestrations, which my household and I like to play collectively.

Telestrations is very like the game of telephone: gamers go round in a circle, taking in communication from the individual on one facet, and in flip speaking their interpretation to the individual on their different facet. As the sport ensues, the unique message is invariably altered, if not misplaced totally. Telestrations differs, nonetheless, by including bimodal communication: gamers alternate between drawing (or illustrating) an outline, and describing (in textual content) an outline.

On condition that I used to be extra focused on evaluating fashions, I tailored the sport to go well with this objective.

Right here’s how the sport of AI Phone works:

  1. Every “sport” will pair up an image-to-text (I2T) mannequin with a text-to-image (T2I) mannequin
  2. Given an preliminary immediate, we use the T2I mannequin to generate a picture.
  3. We then move this picture into the I2T mannequin to generate an outline.
  4. We repeat steps 2 and three a hard and fast variety of instances n (in our case n=10).
  5. Lastly, we quantify the distinction between the unique immediate and the ultimate description.

On this publish, I’ll stroll you thru this complete course of, in an effort to play AI Phone too! On the finish, I’ll reply the three motivating questions.

Notice: This sport of AI Phone is intimately related with the notion of cycle consistency. By incorporating a cycle consistency time period within the loss operate throughout coaching, fashions might be incentivized to, successfully, decrease degradation over a sport of phone. To my data, not one of the fashions thought-about on this experiment have been educated with cycle consistency as a consideration.

The publish is structured as follows:

  1. Choosing the Multimodal Models
  2. Generating the Prompts
  3. Creating Telephone Lines
  4. Carrying out the Conversations
  5. Visualizing and Analyzing the Results

All the code to run this experiment and play AI Phone might be discovered here.

To run this code, you will have to put in the FiftyOne open source library for dataset curation, the OpenAI Python Library, and the Replicate Python client.

pip set up fiftyone openai replicate
Development of photos in a sport of AI Phone between DALL-E2 and BLIP.

The area of multimodal fashions is huge: on the time of writing, Hugging Face alone has 4,425 T2I fashions and 155 I2T fashions. Taking part in AI Phone with all of those fashions — or perhaps a non-negligible fraction of them — can be fully infeasible. My first process was to pare down this area of potential candidates to a extra manageable set of rivals.

Choosing APIs

To start out this venture, I knew that I’d be working with many fashions. A few of the potential fashions have been fairly giant, and plenty of required their very own environments, with a novel set of necessities. On condition that I deliberate to pair up every T2I mannequin with every I2T mannequin, putting in these fashions domestically to play video games of AI Phone offered a possible dependency purgatory — particularly as a result of I work on a MacBook Professional M1!

To avoid this drawback, I made a decision to stay to fashions that have been accessible through APIs. Particularly, I selected to primarily use Replicate, whose easy interface allowed me to work with T2I and I2T fashions in plug-and-play trend. Virtually each mannequin that I used is open supply, so in case you are braver than I, you’ll be able to run these fashions domestically and keep away from the fees. That being stated, in whole this experiment value < $15 USD.

Textual content-to-Picture Fashions

When choosing T2I fashions, I selected from the fashions in Replicate’s Text to image assortment. My choice standards have been that the mannequin wanted to be low-cost, quick, and comparatively fashionable (judged by the variety of “runs” of the mannequin on Replicate). Moreover, the mannequin wanted to be normal objective, that means that I wasn’t going to contemplate outpainting, brand technology, or anime styling fashions. You might be greater than welcome to attempt taking part in AI Phone with all these fashions for those who’d like!

Given these necessities, I selected Stable Diffusion and Feed forward VQGAN CLIP. Initially, I additionally labored with DALL-E Mini, however in early checks I used to be disillusioned by the mannequin’s efficiency, so I swapped the mannequin out for OpenAI’s DALL-E2, which I accessed by OpenAI’s image generations endpoint.

As a facet notice, proscribing my consideration to API-accessible fashions meant that I didn’t think about Midjourney. There isn’t any official API, and I didn’t wish to use an unofficial API, nor did I wish to enter prompts into Discord one after the other and obtain the generated photos one by one.

To make this course of as plug-and-play as doable, I took an object oriented strategy. I outlined a base Text2Image class, which exposes a way generate_image(textual content):

import replicate

class Text2Image(object):
"""Wrapper for a Text2Image mannequin."""
def __init__(self):
self.title = None
self.model_name = None

def generate_image(self, textual content):
response =, enter={"immediate": textual content})
if kind(response) == listing:
response = response[0]
return response

For Replicate fashions, all that’s wanted is then setting the model_name attribute, figuring out the mannequin on Replicate. For Steady Diffusion, for example, the category definition seems to be like this:

class StableDiffusion(Text2Image):
"""Wrapper for a StableDiffusion mannequin."""
def __init__(self):
self.title = "stable-diffusion"
self.model_name = "stability-ai/stable-diffusion:27b93a2413e7f36cd83da926f3656280b2931564ff050bf9575f1fdf9bcd7478"

For different fashions, similar to DALL-E2, the generate_image(textual content) technique might be overloaded:

import openai
class DALLE2(Text2Image):
"""Wrapper for a DALL-E 2 mannequin."""
def __init__(self):
self.title = "dalle-2"

def generate_image(self, textual content):
response = openai.Picture.create(
immediate=textual content,
return response['data'][0]['url']

Every of those T2I fashions returns the URL of the generated picture, which we will then move on to our I2T fashions.

Picture-to-Textual content Fashions

I adopted an analogous course of to find out the I2T rivals, evaluating candidates in Replicate’s Image to text assortment. After trying on the examples for all the fashions within the assortment, six fashions stood out: BLIP, BLIP-2, CLIP prefix captioning, Fine-grained Image Captioning with CLIP Reward, mPLUG-Owl, and MiniGPT-4. Different fashions have been attractive, similar to CLIP Interrogator, which tries to reverse engineer a immediate you’ll be able to then use to generate an analogous picture. However this felt a bit like dishonest so far as AI Phone was involved!

Taking part in round with the six I2T candidates, I used to be in a position to rapidly remove two fashions from rivalry: BLIP-2 generated responses that have been constantly too quick to be helpful, and the CLIP Caption Reward mannequin generated responses which have been usually incoherent.

In direct analogy with the T2I fashions, I outlined a base class Image2Text class exposing a generate_text(image_url) technique:

class Image2Text(object):
"""Wrapper for an Image2Text mannequin."""
def __init__(self):
self.title = None
self.model_name = None
self.task_description = "Write an in depth description of this picture."

def generate_text(self, image_url):
response =
"picture": image_url,
"immediate": self.task_description,
return response

I then created subclasses for every mannequin. Here’s what the BLIP subclass seems to be like:

class BLIP(Image2Text):
"""Wrapper for a BLIP mannequin."""
def __init__(self):
self.title = "blip"
self.model_name = "salesforce/blip:2e1dddc8621f72155f24cf2e0adbde548458d3cab9f00c0139eea840d0ac4746"

All the fashions are instantiated with the identical process description — to “write an in depth description of this picture”.

Development of photos in a sport of AI Phone between DALL-E2 and mPLUG-Owl.

To be as “scientific” as doable, I assumed it finest to not generate the preliminary prompts myself. As a substitute, (and only for enjoyable) I outsourced the duty to ChatGPT. I requested:

I am taking part in a sport of phone utilizing text-to-image and image-to-text AI fashions. 
I wish to consider these fashions based mostly on their capability to retain complicated semantic
data over the course of lengthy conversations. Your job is to offer me 10 textual content
prompts that I can use to run these video games of phone. You have to give me one 3
straightforward, 3 medium, 3 exhausting, and 1 ultra-hard immediate

I’m taking part in a sport of phone utilizing text-to-image and image-to-text AI fashions. I wish to consider these fashions based mostly on their capability to retain complicated semantic data over the course of lengthy conversations. Your job is to offer me 10 textual content prompts that I can use to run these video games of phone. You have to give me one 3 straightforward, 3 medium, 3 exhausting, and 1 ultra-hard (“inconceivable”) immediate

Listed below are a number of the prompts ChatGPT generated:


"A purple apple sitting on a wood desk with daylight streaming in from a window."


"An astronaut floating within the Worldwide Area Station, looking at Earth by the window, with an area capsule docked within the background."


"A bustling market in an historical Center Japanese metropolis. Merchants haggling over spices and silks, camels carrying items, the solar setting behind a mosque with a crescent moon seen."

Not possible:

"A panoramic scene of a complicated alien civilization on a distant exoplanet. Interstellar autos flying in an indigo sky above towering crystalline constructions. Aliens with various bodily options are interacting, participating in actions like exchanging power orbs, speaking by gentle patterns, and tending to unique, bio-luminescent flora. The planet’s twin moons are seen within the horizon over a glistening alien ocean."

A extra rigorous scientific strategy can be way more intentional with the prompts used, in addition to their categorization.

I then took the textual content prompts generated by ChatGPT and constructed Immediate objects, which contained the textual content for the immediate, and the “stage” of issue assigned by ChatGPT:

class Immediate(object):
def __init__(self, textual content, stage):
self.textual content = textual content
self.stage = stage

ranges = ["easy", "medium", "hard", "impossible"]
level_prompts = [easy_texts, medium_texts, hard_texts, impossible_texts]

def get_prompts():
prompts = []
for stage, texts in zip(ranges, level_prompts):
for textual content in texts:
prompts.append(Immediate(textual content, stage))
return prompts

Development of photos in a sport of AI Phone between VQGAN-CLIP and MiniGPT-4.

The final part to taking part in AI Phone was the “phone line” itself. I created a TelephoneLine class to encapsulate the connection between a T2I mannequin and an I2T mannequin. Given a single phone line, a “sport” of phone is performed by calling the play(immediate, nturns=10), the place the dialog evolves from immediate, and runs for nturns back-and-forth turns.

import os
import hashlib
import fiftyone as fo
from fiftyone import ViewField as F

class TelephoneLine(object):
"""Class for taking part in phone with AI."""
def __init__(self, t2i, i2t):
self.t2i = t2i
self.i2t = i2t
self.title = f"{t2i.title}_{i2t.title}"
self.conversations = {}

def get_conversation_name(self, textual content):
full_name = f"{self.title}{textual content}"
hashed_name = hashlib.md5(full_name.encode())
return hashed_name.hexdigest()[:6]

def play(self, immediate, nturns = 10):
"""Play a sport of phone."""
print(f"Connecting {self.t2i.title} <-> {self.i2t.title} with immediate: {immediate.textual content[:20]}...")
texts = [prompt.text]
image_urls = []

for _ in vary(nturns):
image_url = self.t2i.generate_image(texts[-1])
textual content = self.i2t.generate_text(image_url)
texts.append(textual content)

conversation_name = self.get_conversation_name(immediate.textual content)
self.conversations[conversation_name] = {
"texts": texts,
"image_urls": image_urls,
"stage": immediate.stage

For every sport performed, the dialog is logged with a novel title, generated by hashing the T2I mannequin title, I2T mannequin title, and the immediate textual content (get_conversation_name() technique).

I additionally outfitted the category with a save_conversations_to_dataset() technique, which saves the pictures and descriptions from all video games performed on the phone line to a FiftyOne Dataset:

 def save_conversations_to_dataset(self, dataset):
"""Save conversations to a dataset."""
for conversation_name in self.conversations.keys():
dialog = self.conversations[conversation_name]
immediate = dialog["texts"][0]
stage = dialog["level"]
image_urls = dialog["image_urls"]
texts = dialog["texts"]

for i in vary(len(image_urls)):
filename = f"{conversation_name}_{i}.jpg"
filepath = part of(IMAGES_DIR, filename)
download_image(image_urls[i], filepath)

pattern = fo.Pattern(
filepath = filepath,
conversation_name = conversation_name,
immediate = immediate,
stage = stage,
t2i_model = self.t2i.title,
i2t_model = self.i2t.title,
step_number = i,
text_before = texts[i],
text_after = texts[i+1]


Development of photos in a sport of AI Phone between Steady Diffusion and CLIP Prefix Captioning.

With all the constructing blocks in place, taking part in AI Phone is youngster’s play!

We will instantiate T2I and I2T fashions:

## Image2Text fashions
mplug_owl = MPLUGOwl()
blip = BLIP()
clip_prefix = CLIPPrefix()
mini_gpt4 = MiniGPT4()
image2text_models = [mplug_owl, blip, clip_prefix, mini_gpt4]

## Text2Image fashions
vqgan_clip = VQGANCLIP()
sd = StableDiffusion()
dalle2 = DALLE2()
text2image_models = [sd, dalle2, vqgan_clip]

After which create a phone line for every pair:

combos = [(t2i, i2t) for t2i in text2image_models for i2t in image2text_models]
traces = [TelephoneLine(*combo) for combo in combos]

We then load in our prompts:

prompts = get_prompts()

And create a FiftyOne Dataset which we are going to use to retailer the generated photos and all related data from the conversations:

import fiftyone as fo

dataset = fo.Dataset(title = 'phone', persistent=True)
dataset.add_sample_field("conversation_name", fo.StringField)
dataset.add_sample_field("immediate", fo.StringField)
dataset.add_sample_field("stage", fo.StringField)
dataset.add_sample_field("t2i_model", fo.StringField)
dataset.add_sample_field("i2t_model", fo.StringField)
dataset.add_sample_field("step_number", fo.IntField)
dataset.add_sample_field("text_before", fo.StringField)
dataset.add_sample_field("text_after", fo.StringField)

We will then run all 120 video games of phone:

from tqdm import tqdm

for line in tqdm(traces):
for immediate in prompts:, nturns = 10)

session = fo.launch_app(dataset)

Within the FiftyOne App, click on on the splitting icon within the menu bar to group photos by dialog, choose conversation_name from the dropdown, then toggle the selector to ordered and choose step_number.

To evaluate the standard of a dialog — purely by way of how intently the that means of the ultimate description approximated the that means of the preliminary immediate, I made a decision to generate embeddings for the prompts and descriptions, and compute the cosine distance (in [0, 2]) between the 2.

from scipy.spatial.distance import cosine as cosine_distance

For an embedding mannequin, I needed a mannequin that might embed each textual content and pictures, given the multimodal nature of the train. I ended up selecting to make use of ImageBind for 3 causes:

  1. Different fashionable joint image-text embedding fashions like CLIP and BLIP are associated to a number of the fashions I used within the experiment (BLIP and CLIP prefix captioning), and I needed to keep away from any doable biases from utilizing the identical forms of fashions for analysis.
  2. Many textual content embedding fashions have a small max_token_count — the utmost variety of tokens allowed in a textual content to be embedded. CLIP, for example, has max_token_count=77. A few of our descriptions are considerably longer than this. Happily, ImageBind has a for much longer most token rely.
  3. I’d been that means to attempt ImageBind, and this was an important alternative!

I wrapped Replicate’s ImageBind API in a operate embed_text(textual content):

MODEL_NAME = "daanelson/imagebind:0383f62e173dc821ec52663ed22a076d9c970549c209666ac3db181618b7a304"
def embed_text(textual content):
response =
"text_input": textual content,
"modality": "textual content"
return np.array(response)

To keep away from redundant computations, I hashed the prompts and saved the immediate embeddings in a dictionary. This manner, as a substitute of embedding every immediate for every of the 12 phone traces, we solely must embed every as soon as:

import hashlib
def hash_prompt(immediate):
return hashlib.md5(immediate.encode()).hexdigest()[:6]

### Embed preliminary prompts
prompt_embeddings = {}
dataset.add_sample_field("prompt_hash", fo.StringField)

## Group samples by preliminary immediate
## Add hash to all samples in group
prompt_groups = dataset.group_by("immediate")
for pg in prompt_groups.iter_dynamic_groups():
immediate = pg.first().immediate
hash = hash_prompt(immediate)
prompt_embeddings[hash] = embed_text(immediate)
view = pg.set_field("prompt_hash", hash)"prompt_hash")

We will then group samples by dialog title, iterate by these teams, compute the textual content embedding for every step, and report the cosine distance (smaller is healthier!) between the textual content embedding and the preliminary immediate embedding:

dataset.add_sample_field("text_after_dist", fo.FloatField)

prompt_groups = dataset.group_by("conversation_name")
for cg in conversation_groups.iter_dynamic_groups(progress=True):
hash = cg.first().prompt_hash
prompt_embedding = prompt_embeddings[hash]

ordered_samples = cg.sort_by("step_number")
for pattern in ordered_samples.iter_samples(autosave=True):
text_embedding = embed_text(pattern.text_after)
pattern["text_embedding"] = text_embedding
pattern.text_after_dist = cosine_distance(

I then computed the typical scores for every T2I-I2T pair throughout all prompts at a sure stage of issue and plotted the outcomes. In every of the movies, the I2T and T2I fashions are printed on the generated photos, in addition to the textual content used to generate that picture (purple), and the outline generated from that picture (inexperienced).


For straightforward prompts, efficiency tends to rely most strongly on the text-to-image mannequin. DALL-E2 and Steady Diffusion dramatically outperform VQGAN-CLIP. MiniGPT-4 is a member of each of the top-performing pairs.

Listed below are some examples for the simple immediate launched above:

AI Phone for a straightforward immediate, with pairs of text-to-image and image-to-text fashions.

Within the video games with MiniGPT-4 (and to a barely lesser extent BLIP), the apple stays entrance and middle, whereas for video games involving CLIP Prefix, the apple will get phased out over time.


When the prompts change into a bit harder, the scenario begins to alter.

AI Phone for a medium issue immediate, with pairs of text-to-image and image-to-text fashions.

For practically all the video games, the topic adjustments someplace across the fourth or fifth step. Early on, MiniGPT-4 holds a bonus. However by the tip of the sport, that benefit appears to have been totally misplaced.


By the point the prompts change into difficult, we begin to see one thing attention-grabbing: for early steps, the image-to-text mannequin is most necessary (MiniGPT-4 is finest, and CLIP Prefix is for essentially the most half the worst). By later phases, nonetheless, the text-to-image mannequin turns into most necessary. And to complicate the scenario additional, VQGAN-CLIP is finest right here!

One may fear that “higher” may simply imply that consistency is maintained, with out precisely representing the unique idea. Nonetheless, after we take a look at examples, we will see that this isn’t the case.

AI Phone for a tough immediate, with pairs of text-to-image and image-to-text fashions.

Take the instance highlighted within the video, the place the preliminary immediate is the “exhausting” immediate launched above regarding a “bustling market”. Whereas the pictures generated by VQGAN-CLIP are for sure grainy, the topic can nonetheless be made out, and matches the unique immediate pretty intently.

Not possible

Unsurprisingly, none of our rivals do terribly properly right here. One may argue that VQGAN-CLIP is the winner. However for essentially the most half, that is all simply noise. Within the video, even for video games involving VQGAN-CLIP, the topic is successfully unrecognizable.

AI Phone for an “inconceivable” immediate, with pairs of text-to-image and image-to-text fashions.

This exploration was removed from scientific; I solely checked out ten prompts, with out true validation of their issue stage. I solely ran the conversations out to 10 back-and-forth steps; and I solely evaluated efficiency on one metric.

It’s clear that which T2I and I2T fashions fare finest relies upon largely on the complexity of the immediate, and the way lengthy you wish to hold the fashions speaking. Nonetheless, it’s price noting just a few key observations:

  1. VQGAN-CLIP might fare higher for more difficult prompts, however this doesn’t imply it’s a higher T2I mannequin. The pictures produced by VQGAN-CLIP are sometimes far much less coherent and globally constant than these produced by Steady Diffusion or DALL-E2.
  2. The evaluation above is all about semantic similarity — it doesn’t take type into consideration. The type of those photos can change a ton over the course of a sport of AI Phone. Anecdotally, I discovered that the type is rather more constant for I2T fashions like mPLUG-Owl, which give lengthy descriptions, than for fashions like BLIP, whose descriptions are extra topic targeted.
  3. By round 5 – 6 iterations, the video games had principally converged to steady equilibria.
  4. Although the embedding mannequin, ImageBind, was multimodal, the space between consecutive picture embeddings and textual content embeddings have been far better than the space between consecutive photos or consecutive descriptions. Usually, they adopted the identical traits, however in much less pronounced trend, which is why I didn’t embody these within the plots.

I hope this evokes you to run your individual experiments with generative AI — whether or not you’re taking part in AI Phone, or doing one thing else totally!

When you check out a variation of this and get attention-grabbing outcomes, touch upon this publish!

The best way to Create Stunning Bar Charts with Seaborn and Matplotlib (Together with Animation) | by Oscar Leo | Jun, 2023

Mastering Configuration Administration in Machine Studying with Hydra | by Joseph Robinson, Ph.D. | Jun, 2023