Towards Agentic AI Workflows with New LLM Architectures

We’re approaching a new phase in the AI ecosystem that’s not without uncertainty and discourse. But over the next several months, what is for certain is the introduction of Agentic AI workflows that can actually deliver the hyperbolic proclamations for which pundits have been bending over backwards.

Indeed, the Agentic AI ecosystem is slowly taking shape, paving a smoother road to enterprise adoption. Frameworks and developer tools, like LangChain, are gaining momentum and acceptance, allowing developers to build workflows where Large Language Models (LLMs) and fine-tuned LLMs can interact with each other and with APIs/services to execute a task. This phenomenon of building Agents will become the focal point of the industry. In the words of LangChain’s CEO Harrison, it’s like “running LLMs in a for-loop, and asking the LLM to reason and plan what the next best step is to achieve the task at hand.” But running LLMs at scale, as required in a complex workflow, isn’t clear cut yet.

We’re definitely far beyond the debut of ChatGPT in November 2022, but we’ve arrived at a strange phase of limbo. There’s only marginal gains for new foundational models, and the critical components in these would-be workflows aren’t feasible to ingest and process the steps of an Agentic AI workflow, like reflection, or planning. Not to mention multi-agent collaboration. Why? The Transformer architecture doesn’t quite fit the bill, and the new growing demands are straining them. Context windows and costs are the main points of contention.

Among the LLM leaderboards, the dust is beginning to settle, hinting at which architecture might lead us towards Agentic AI.

Or Dagan, VP of Foundational models at AI21 Labs, has been working on this pioneering technology since 2018, having led the development of Wordtune and their latest family of foundational models that’s receiving critical reception for veering the industry to a new direction. “Transformers have been pivotal in advancing Natural Language Processing, yet their reliance on extensive memory, quadratic processing, and demanding compute resources pose significant hurdles,” explained Dagan. “These challenges make it costly and impractical to scale Transformers for tasks involving lengthy documents or vast datasets.”

Long Context and Accuracy Over Multimodality

If there’s ever been a differentiating moment in the industry, it’s simply not the multimodality, yet. New releases from OpenAI’s GPT-4o and even Apple’s 4M AI model are beyond impressive, pushing the envelope to capabilities of creating and editing images and videos, and even 3D assets creation with Meta’s 3D TextureGen. But it’s still the basics that matter. For the Agentic AI realm, enterprise adoption is hinged on text, data analysis and coding tasks, combined with accurate outputs from long inputs.

The current problem with long context prompts is that the models aren’t quite able to generate a coherent response from internal data. The term is called hallucinations. Retrieval Augmented Generation (RAG) is touted as a solution to hallucinations. It excels in scenarios where keyword-based searches suffice, like retrieving factual information about well-defined topics. Yet, challenges arise in more complex tasks requiring abstract reasoning, where keyword searches may fail to pinpoint relevant documents or where models may ignore retrieved context in favor of internal knowledge. Moreover, implementing it is resource-intensive. RAG alone cannot entirely eliminate the issue of model hallucinations in AI systems. And newer model architectures need to better use RAG retrieved data.

Some newer directions in RAG include the variant GraphRAG, which utilizes graph-based representations of data and knowledge, like a network of interconnected nodes and edges in a graph. It’s a technique industry experts are claiming will alleviate hallucinations in long context prompts that involve reasoning and understanding throughout the prompt. But we’re not quite free enough to feed our models with lengthy prompts, which is part and parcel with a typical Agentic AI workflow.

Mamba Hybrid State Space Model LLM Architecture

In a departure from pure Transformer based architectures, AI21 Labs’ new architecture offers a robust, scalable solution to the lengthy texts and accurate outputs we desperately need now. It’s called the Jamba model, with an instruction Jamba-Instruct variation available to the commercial community. It combines Transformer layers with Mamba layers, along with several MoE modules. The advantages of such an architecture is high throughput for super long contexts and a reduced memory footprint. In other words, it offers ultra low cost of inference, and accepts huge text inputs.

MAMBA (Memory-Augmented Modular Bidirectional Attention) is a novel, deep learning LLM architecture that aims to address some of the limitations of traditional Transformer models. Unlike pure Transformers, MAMBA incorporates memory augmentation, allowing the model to store and retrieve relevant information dynamically during the generation process. This memory component enhances the model’s ability to maintain context over long sequences and improve coherence in generated text. The modular design of MAMBA also enables better adaptability and scalability, making it a promising alternative to current architectures.

By incorporating a state module, Mamba architecture can access and utilize historical data more effectively, while remaining accurate and contextually relevant. The modular approach allows for more efficient training and fine-tuning, as different components can be optimized independently. These features make Mamba a strong contender for next-generation LLMs, providing improved performance and flexibility.

“Jamba-Instruct offers an inference context window of 256,000 tokens, and can actually use all those 256,000 tokens effectively” said Dagan. Balancing long context with low cost, accuracy and speed is the name of the game right now, and most competitors’ promises are not met. A recent analysis by NVIDIA, RULER, illustrates that, ostensibly, leaderboard models claim to offer long context windows with lightning latency, but upon thorough testing, their “effective” context lengths, where the model still produce good results, are half of what’s advertised or less. Moreover, most of them surreptitiously corral their models to accept longer context in the lab, or on a small percentage of usage. It could be considered the ‘window washing’ of the LLM community. Case in point; Google’s Gemini models with a dubious 2 million context window, is taking a beating in the community. The Jamba model is built from a hybrid State Space Model (SSM) Transformer architecture, that allows for a well-balanced combination of the priorities we need for Agentic AI workflows. Speaking to the cost of inference, “a 100K token prompt clocks in at 5 cents,” on par with models like Claude 3 Haiku, Command R and Phi-3.

To alleviate model hallucinations, Jamba’s effective long context window coalesces with their Retrieval-Augmented Generation (RAG) engine, allowing enterprises to actually build agents based on their own data.

An instruct-tuned model is a type of AI model that has been fine-tuned specifically to understand and follow human instructions more effectively. This process involves additional training on a dataset that contains various types of instructions and corresponding responses, making the model better suited for tasks that require interpreting and acting on user commands. This fine-tuning enhances the model’s ability to generate appropriate, context-aware responses in line with the given instructions. Credit: AI21 Labs.

“Most model developers are cost-aware today, and it’s grounded in enterprise clients’ priority for production usage. We have to negotiate between accuracy, throughput and price, and to address that, we built and released Jamba commercially.”

xLSTM, JEPA and Safe Superintelligence

AI21 Labs isn’t the only one trekking off the beaten path. LSTM and JEPA are also gaining notoriety for solving the growing pains inherent to the pure Transformer based LLM architectures of today.

Long Short Term Memory (LSTM), once dismissed by experts, is experiencing a resurgence. Sepp Hochreiter recently spoke at the Cyber Valley Days hosted by the ELLIS Institute in Tubingen where he said his PhD advisor totally overlooked his work on LSTM in his dissertation and yet, today, companies profit billions off of his architecture. “We published the ELU activation function too and I got a mojito out of it.” Commenting on the future of LLMs, “We need to get rid of this Transformer nonsense. It’s very inefficient, and people don’t pay attention to a large window of words, they remember abstract representations of past text.” In light of this, Hochretier proposed the xLSTM architecture for just that, but alluded to the costly aspects of testing it at a large scale. Hochreiter is pursuing this new direction with his new research startup; NXAI.

In the Meta world, Joint Embedding Predictive Architecture (JEPA) represents another departure from traditional Transformer models. Proposed by Yann LeCun, JEPA addresses the limitations observed in current Transformers, particularly in understanding context and predicting future states. Unlike conventional models, JEPA employs self-supervised learning techniques to generate abstract representations of the world. It processes pairs of related inputs, such as sequential video frames, focusing on essential features while filtering out irrelevant details. This strategy enables JEPA to effectively predict future states and infer missing information, adeptly managing uncertainty in data.

Nvidia also recently released DoRA to tackle the inefficiencies of traditional fine-tuning methods like LoRA by enhancing learning capacity and stability without introducing additional inference costs. It’s another newly introduced architecture for training, albeit for visual inputs, that reflects an industry priority to overcome the limitations of current architectures.

In Developer Activity We Trust

Aside from model performance, another key factor that will lead us forward to an Agentic AI era uninhibited, organic developer activity. To date, our trust in foundational model provider’s in almost blind. We’re de facto trusting them with the entirety of our data and such responsibility warrants true integrity. And yet, a fraction of them divulge key details for reproducibility. Open source, open weights or training data; there’s a penchant for opacity.

OpenAI’s new CTO, Mira Murati, prevaricated when pressed on the origins of their training data earlier this year. While that was less than transparent to most, all major web applications for LLMs experienced a near 48-hour blackout last month, without explanation, leading some industry cyber experts to ascribe it to an industry-wide cyber attack – still undeclared.

Transparency and safety are paramount for any user, especially enterprises attempting to feed their entire database into an Agentic AI workflow.

“We really want to keep pushing the envelope with AI developments, and chose to publish Jamba’s base model as open-weights because we believe the developer community plays a seminal role in bringing us to better training and inference methods,” added Dagan. “Our new architecture is a major power-up from our previous models. And with people discovering new things and experimenting with it, we’re really excited to foster a better, collaborative direction, marked with fidelity to safety and transparency.”

Dagan and his team are among a small pocket of foundational model companies with integrity as their guiding north star. Ilya Sutskever, former co-founder and Chief Scientist of OpenAI, recently launched a new startup with the descriptive name, Safe Superintelligence. While details about the specific model architecture remain under a tight lid, the company’s mission is also a clarion call on the growing importance of safety and ethics in LLM development. Sutskever’s experience and vision suggest that the startup will prioritize creating robust and secure AI foundational models capable of handling complex challenges while ensuring alignment with human values.

While we move forward at a clip of innovation-by-the-week, Agentic AI is certainly on the horizon, and its coming of age will be inextricable to the modern work livelihood. It still remains to be seen which architecture will lead us to the finish line. And industry experts, although experts, often get it wrong. Even if the next wave of foundational models cost $100 billion to train, adhering to the exponential curve trajectory with chip advancements in parallel, it seems like the bells and whistles they may bring still lack in the face of the basics of cost and context.

“I think one of the great things about Jamba is that it solves some of the current pains in the LLM space by trying something new and different,” Dagan reflected. “When we look at the key challenges in the current phase from LLMs into AI systems and Agents, we identify a few other key places where similar, novel approaches, are in dire need for the technology to successfully make the next big jump. It’s exciting.”