ChatGPT's Vision-Enabled Chatbot Makes Extremely Weird Sound When It Sees a Dog

This week, OpenAI released a surprisingly flirty new large language model (LLM) called GPT-4o — that’s a lower-case “o,” for “omni” — that it claims can react to live video from a smartphone’s camera, giving a new demo a fluid sense of the chatbot being present in the world in a way that previous attempts haven’t.

The new AI, which will power the upcoming free version of ChatGPT, sounds astonishingly natural in the demos, making off-the-cuff quips, snarky comments, and otherwise human-like noises that inject plenty of emotion into its outputs.

But the illusion is far from perfect. In one demo that involves a tester introducing GPT-4o to his adorable dog, the AI responds by making a bizarre and unnervingly robotic series of screeches that coalesce into the kind of “aww” a real human might make when meeting a furry friend.

In other words, it’s as if an AI is trying its best to pass as a human — but not quite getting there.

“I want to introduce you to somebody,” the tester tells ChatGPT.

“Whh-ell hee-eekkhhh-ello there cutie,” the AI responds, sounding more natural after it gets past the more emotive part of the phrase.

The company claims its latest version of ChatGPT will be “much faster” thanks to a new model that was trained “end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.”

A big part of its Monday announcement is ChatGPT’s new “video mode,” which allows it to “see” through the user’s webcam or smartphone camera, giving it the ability to react to the physical world around it in real-time.

The demo was impressive, but not quite selling the illusion. Beyond failing to accurately emulate a human’s reaction to a cute dog, the new LLM also struggled to generate outputs that were “dripping with sarcasm,” for instance.

“Oooh, that just sounds amazing,” ChatGPT responded with a heavy undertone that could possibly be described as sarcasm. “I’m sooo excited for this, huh.”

A different demo involving ChatGPT whispering was even more terrifying. There’s something deeply unnerving about an AI that resembles an overenthusiastic high school drama teacher lowering its voice and quietly singing a lullaby.

“The model is able to generate voice in a variety of different emotive styles,” senior developer Mark Zhen explained during Monday’s presentation, “and it really has a wide dynamic range.”

In a demo designed to highlight the new AI’s “voice variation,” ChatGPT read a bedtime story in a highly exaggerated tone reminiscent of a kindergarten teacher. It was even able to tell the story while singing.

In short, it shouldn’t come as a surprise that OpenAI CEO Sam Altman cited the 2013 sci-fi blockbuster “Her” as a source of inspiration, with ChatGPT sounding more like Scarlett Johansson’s disembodied voice than ever before.

But the mirage of a cheerful human assistant trapped inside a smartphone isn’t foolproof just yet, as the “dog meets GPT-4o” demo shows. Voice assistants are still stuck deep in the uncanny valley, and it’s not clear when they’ll reach the other side.

Needless to say, once the tool does finally land in the hands of users outside of OpenAI — the company has been vague about exactly when that’ll be — we’re expecting a lot more emotional AI weirdness still to come.

More on the new ChatGPT: New ChatGPT AI Watches Through Your Camera, Offers Advice on What It Sees