OpenAI's Video-Generating AI Is "Doomed to Failure," Says Meta's Top AI Scientist

Harsh words from one of AI’s godfathers.

Pixel Imperfect

Sora, OpenAI’s new AI model for generating video, has become the talk of the town since its release last week. But Yann LeCun, chief AI scientist at Meta, doesn’t think the much-hyped text-to-video model is all that.

In particular, LeCun takes issue with OpenAI’s claims that its work with Sora will eventually enable the building of “general purpose simulators of the physical world.” If that’s the case, LeCun argues, its approach to creating a “world simulator” is dead wrong.

“Modeling the world for action by generating pixels is as wasteful and doomed to failure as the largely-abandoned idea of ‘analysis by synthesis,'” he wrote in a post on X, formerly Twitter.

Generation Complication

LeCun is one of the so-called godfathers of AI — and perhaps the bluntest and most outspoken one at that. While the other two godfathers lament what they’ve unleashed, LeCun has pushed on with his work at Meta, never afraid to criticize his competitors.

With his comments here, he’s referring to an age-old debate in machine learning between generative models and discriminative models. LeCun believes that the former approach, generating pixels “from explanatory latent variables,” is too inefficient, and can’t adequately deal with the uncertainty that arises from these complex predictions in a 3D space.

In layman’s terms, he’s arguing that these models are trying to “infer” too many details that aren’t relevant — sort of like trying to calculate the trajectory of a soccer ball by trying understand how every material it’s made of comes into play instead of just focusing on stuff like its mass and velocity.

“There is nothing wrong with that if your purpose is to actually generate videos,” he said in a reply to his post. “But if your purpose is to understand how the world works, it’s a losing proposition.”

The Alternative

LeCun concedes that, by and large, the generative approach has worked with large language models like ChatGPT so far “because text is discrete with a finite number of symbols.” But if you’re going to simulate the world like Sora is supposed to, you’re dealing with much more than just a couple of characters.

Competing with OpenAI’s approach, LeCun has been working on his own model at Meta called the Video Joint Embedding Predictive Architecture (V-JEPA), which was unveiled last week.

“Unlike generative approaches that try to fill in every missing pixel,” Meta claims in a blog post, “V-JEPA has the flexibility to discard unpredictable information, which leads to improved training and sample efficiency by a factor between 1.5x and 6x.”

LeCun’s work may not get all the hype that OpenAI’s products do with their flashy image and text generation, but it is interesting to see such a prominent AI researcher diverging from the same old approaches currently being developed by OpenAI and its slew of imitators.

More on AI: ChatGPT Appears to Have Lost Its Mind Last Night