System 2 AI; Learning to Move with Vayu Robotics' LLM

Multimodal Large Language Models (MLLMs) are now powering robotic navigation systems, and they’re compact enough to work at 10 frames per second, on the edge. Vayu Robotics is building one to power autonomous delivery robots, with plans to expand its use across the full gamut of autonomous robots and vehicles.

The advent of LLMs brought a slew of use cases for enterprise and media-related tasks. Combined with sensor arrays, synthetic data, and cutting-edge RAG-like techniques, Vayu Robotics has fashioned them into a high-powered operating system for robotic perception, reasoning, and navigation.

As a mark of validation, the Palo Alto-based startup just signed a deal with an e-commerce company to deploy 2,500 units of their delivery fleet.

“The goal for Vayu is to be the force to drive all the machines in the world,” Nitish Srivastava, co-founder and CTO of Vayu Robotics, told StartupHub.ai in an exclusive interview. “The reason why foundation models work here is because it’s largely a data problem.”

The corner cases of data

One area where traditional neural networks struggle is the ‘corner case problem.’ Srivastava describes it as ‘the long tail problem.’ While deep learning can manage most scenarios—navigating from point A to point B—it’s the “long tail” of rare, complex situations that pose the greatest challenges. “The hardness lies in the tail,” he says, referring to unexpected events that are difficult to predict and classify.

Vayu tackles this challenge by leveraging foundation models that can generalize well across diverse scenarios. Srivastava explains, “If you wanted to build a really good text summarization model, you’d take an LLM trained to do next token prediction on all the data in the world, and it would outperform a model trained solely for that task. Similarly, a mobility foundation model trained on multiple navigation domains and robot form factors will do better at navigating a car because it has a better general representation space; it understands how to move around things. Foundation models will give us the power to solve all robotics problems.”

At a high level, the focus is on understanding what’s happening around the robot—this is “System 1,” which involves quick, instinctive decisions such as recognizing obstacles. The more complex aspect, “System 2,” deals with deeper reasoning and problem-solving, like determining how to navigate around unexpected obstacles or how objects in the environment might behave. Srivastava notes, “our mobility foundation model is solving System 2 problems, like identifying a new object in front of you and predicting its movement and behavior.”

500 miles per GPU per day

Unlike the industry norm of requiring over $250 million to develop such models, Vayu has adopted a more efficient, cost-effective strategy. “These new LLMs, like Meta Llama 3.1 405B, are open source and serve as building blocks for the startup community,” says Srivastava. Rather than using these LLMs as-is, Vayu’s ingenuity manifests in using them for synthetic data, and compacting them into smaller form factors for the demand of high speed motion inference.

“We’re currently fine-tuning as well as training some specific parts of the mobility foundation model from scratch,” revealed Srivastava. “This model builds on large vision-language models (VLMs) that have already distilled vast amounts of information from the internet. While these VLMs are proficient at “seeing” and understanding what’s in front of them, our focus is on the “learning to move” problem in robotics. “Essentially, we’re developing a layer that integrates this visual understanding with real-world motion, geometry, and actuation. This requires fine-tuning the existing models to prioritize relevant information, like immediate surroundings and distances, while adding new architectural components to enhance robotic perception and movement.”

Equipped with a proprietary plenoptic sensor, Vayu Sense integrates AI, cameras, and advanced photon sensing technology to deliver precise 3D perception. Unlike traditional lidar or radar, it uses a combination of CMOS imaging and photon-level data processing to excel in low light, adverse weather, and complex environments. The system includes an SDK for developers, allowing customization and integration into various robotics and AI applications, making it a versatile and scalable solution for next-gen autonomous systems. Credit: Vayu Robotics.

Additionally, Vayu’s innovations in sensor technology are designed to handle the limitations of traditional cameras, such as issues with glare, high dynamic range situations, and the inability to see certain materials, such as glass. By enhancing the capabilities of their sensors, Vayu ensures that their robots can operate effectively even in complex environments with varied lighting conditions and reflective surfaces.

A key innovation for Vayu is the generation of synthetic data to train their models. “We are generating about 500 miles per GPU per day right now,” Srivastava notes, “which scales much faster than anybody else can generate by just driving around physically in the world. But it’s not just about the number of miles—it’s about the quality of the data. We generate data where something interesting can happen, like a car door being opened, a child running in front of the vehicle, or a trash can placed in a difficult spot. This type of data is incredibly useful for training the model and providing valuable gradients, rather than just collecting empty miles where nothing is happening.”

Vayu has developed techniques to transfer policies learned in synthetic environments to real-world applications. “Synthetic data isn’t always realistic, but we’ve figured out how to transfer it into real-world contexts. We can generate high-quality synthetic data for corner cases, and then use real-world data to align and validate the models at the final stage,” Srivastava explains.

Ultimately, Vayu’s foundation model, coupled with their advanced sensor technology, offers a versatile solution for robotics and autonomous vehicles. “A model that’s trained to do everything… will probably do better at solving even that one problem because it just has a much more general representation space.”

The startup’s roots in deep tech are second to none, with its founders—Nitish Srivastava, Mahesh Krishnamurthi, and Anand Gopalan—bringing expertise from Apple’s perception unit and leading Velodyne. The team managed to bring Geoffrey Hinton, Srivastava’s PhD professor, and one of the godfathers of Deep Learning, onto their advisory board.

According to Bloomberg’s Business Intelligence unit, the market for Generative AI Computer Vision could reach $22 billion by 2027, and $61 billion by 2032. Considering this research predates the exploration of vision LLMs, the true market potential is likely even greater.

Only this year has the AI community built a benchmarking evaluation for vision LLMs on LMSYS. New models and benchmark performances are surfacing at a dizzying pace.

Nvidia’s CEO and co-founder, Jensen Huang, is also a strong advocate for the future of robotics, particularly humanoids. They released some new tools like NIM, introduced at GTC 2024, which allows developers to simulate and train robots in a 3D meta-world, ensuring a seamless transition to real-world tasks. NVIDIA’s Osmo platform, essential for generating synthetic data and reinforcement learning, supports these advancements.

Embodied Chain-of-Thought and Humanoids

More than just a moment in LLMs, the developer community and academia are racing to refine the processing of vision inputs for autonomous motion. A key technique emerging from this effort is Embodied Chain-of-Thought Reasoning (ECoT). Similar to how Retrieval Augmented Generation (RAG) prevents hallucinations in language models, ECoT enhances robotic decision-making by integrating task understanding with real-time perception of the environment. Unlike traditional models that map visual inputs directly to actions, ECoT allows robots to break down tasks into smaller steps and reason through each one within the context of their surroundings. It’s especially useful in complex or unfamiliar situations, and is baked into Vayu’s foundation model.

Srivastava did admit that while promising, ECoT currently achieves around “70% accuracy,” suggesting there’s still room for improvement.

To that end, the startup’s fine-tuned foundation model is a smaller version of multiple leading models. And with advances in inference speed capabilities, both from ingenuities on Vayu’s side and hardware chips, like that of Groq’s latest TPU, Nitish expects his model to be able to power 30 frames per second navigation in the near future. In other words, serving autonomous vehicles.

Another more immediate use case for Vayu’s robotic foundation model is powering the humanoids slated for commercial release in 2025. Names like 1x, Boston Dynamics, Field AI, Figure, Fourier, Galbot, LimX Dynamics, Mentee Robotics, Neura Robotics, RobotEra and Skild AI, make up the current list of humanoid robot developers. Some of which are experimenting with Nvidia’s humanoid dedicated foundation model, GR00T, announced in March this year. But Srivastava believes there are more immediate needs for robotics in other form factors.

“It’s been an amazing journey to start from somewhere where people are trying to classify on these digits, to right now, just solving all the problems in the world in a single model.”

Vayu Robotics is backed with $12.7 million in Seed funding from Khosla Ventures, Lockheed Martin Ventures and ReMY Investors, among others. With their foundation model, Vayu is poised to push beyond last-mile delivery robots into other cutting-edge applications. Pondering the future, under their guidance, LLMs may soon extend beyond Earth’s roads.